XPER: Unveiling the Driving Forces of Predictive Performance

-

A latest method for decomposing your favorite performance metrics

Photo by Sira Anamwong on 123RF

Co-authored with S. Hué, C. Hurlin, and C. Pérignon.

Trustability and acceptability of sensitive AI systems largely rely on the capability of the users to grasp the associated models, or at the very least their forecasts. To lift the veil on opaque AI applications, eXplainable AI (XAI) methods comparable to post-hoc interpretability tools (e.g. SHAP, LIME), are commonly utilized today, and the insights generated from their outputs at the moment are widely comprehended.

Beyond individual forecasts, we show in this text the best way to discover the drivers of the performance metrics (e.g. AUC, R2) of any classification or regression model using the eXplainable PERformance (XPER) methodology. With the ability to discover the driving forces of the statistical or economic performance of a predictive model lies on the very core of modeling and is of great importance for each data scientists and experts basing their decisions on such models. The XPER library outlined below has proven to be an efficient tool to decompose performance metrics into individual feature contributions.

While they’re grounded in the identical mathematical principles, XPER and SHAP are fundamentally different and easily have different goals. While SHAP pinpoints the features that significantly influence the model’s individual predictions, XPER identifies the features that contribute probably the most to the performance of the model. The latter evaluation could be conducted at the worldwide (model) level or local (instance) level. In practice, the feature with the strongest impact on individual forecasts (say feature A) might not be the one with the strongest impact on performance. Indeed, feature A drives individual decisions when the model is correct but in addition when the model makes an error. Conceptually, if feature A mainly impacts erroneous predictions, it might rank lower with XPER than it does with SHAP.

What’s a performance decomposition used for? First, it might enhance any post-hoc interpretability evaluation by offering a more comprehensive insight into the model’s inner workings. This enables for a deeper understanding of why the model is, or is just not, performing effectively. Second, XPER might help discover and address heterogeneity concerns. Indeed, by analyzing individual XPER values, it is feasible to pinpoint subsamples wherein the features have similar effects on performance. Then, one can estimate a separate model for every subsample to spice up the predictive performance. Third, XPER might help to grasp the origin of overfitting. Indeed, XPER permits us to discover some features which contribute more to the performance of the model within the training sample than within the test sample.

The XPER framework is a theoretically grounded method that is predicated on Shapley values (Shapley, 1953), a decomposition method from coalitional game theory. While the Shapley values decompose a payoff amongst players in a game, XPER values decompose a performance metric (e.g., AUC, R2) amongst features in a model.

Suppose that we train a classification model using three features and that its predictive performance is measured with an AUC equal to 0.78. An example of XPER decomposition is the next:

The primary XPER value 𝜙₀ is known as the benchmark and represents the performance of the model if not one of the three features provided any relevant information to predict the goal variable. When the AUC is used to judge the predictive performance of a model, the worth of the benchmark corresponds to a random classification. Because the AUC of the model is larger than 0.50, it implies that at the very least one feature comprises useful information to predict the goal variable. The difference between the AUC of the model and the benchmark represents the contribution of features to the performance of the model, which could be decomposed with XPER values. In this instance, the decomposition indicates that the primary feature is the foremost driver of the predictive performance of the model because it explains half of the difference between the AUC of the model and a random classification (𝜙₁), followed by the second feature (𝜙₂) and the third one (𝜙₃). These results thus measure the worldwide effect of every feature on the predictive performance of the model and to rank them from the least essential (the third feature) to a very powerful (the primary feature).

While the XPER framework could be used to conduct a worldwide evaluation of the model predictive performance, it might even be used to offer a neighborhood evaluation on the instance level. On the local level, the XPER value corresponds to the contribution of a given instance and have to the predictive performance of the model. The benchmark then represents the contribution of a given remark to the predictive performance if the goal variable was independent from the features, and the difference between the person contribution and the benchmark is explained by individual XPER values. Due to this fact, individual XPER values allow us to grasp why some observations contribute more to the predictive performance of a model than others, and could be used to deal with heterogeneity issues by identifying groups of people for which features have similar effects on performance.

It’s also essential to notice that XPER is each model and metric-agnostic. It implies that XPER values could be used to interpret the predictive performance of any econometric or machine learning model, and to interrupt down any performance metric, comparable to predictive accuracy measures (AUC, accuracy), statistical loss functions (MSE, MAE), or economic performance measure (profit-and-loss functions).

01 — Download Library ⚙️

The XPER approach is implemented in Python through the XPER library. To compute XPER values, first one has to put in the XPER library as follows:

pip install XPER

02 — Import Library 📦

import XPER
import pandas as pd

03 — Load example dataset 💽

As an instance the best way to use XPER values in Python, allow us to take a concrete example. Consider a classification problem whose foremost objective is to predict credit default. The dataset could be directly imported from the XPER library comparable to:

import XPER
from XPER.datasets.load_data import loan_status
loan = loan_status().iloc[:, :6]

display(loan.head())
display(loan.shape)

The first goal of this dataset, given the included variables, appears to be constructing a predictive model to find out the “Loan_Status” of a possible borrower. In other words, we would like to predict whether a loan application will likely be approved (“1”) or not (“0”) based on the data provided by the applicant.

# Remove 'Loan_Status' column from 'loan' dataframe and assign it to 'X'
X = loan.drop(columns='Loan_Status')

# Create a brand new dataframe 'Y' containing only the 'Loan_Status' column from 'loan' dataframe
Y = pd.Series(loan['Loan_Status'])

04 — Estimate the Model ⚙️

Then, we want to coach a predictive model and to measure its performance with a purpose to compute the associated XPER values. For illustration purposes, we split the initial dataset right into a training and a test set and fit a XGBoost classifier on the training set:

from sklearn.model_selection import train_test_split

# Split the info into training and testing sets
# X: input features
# Y: goal variable
# test_size: the proportion of the dataset to incorporate within the testing set (on this case, 15%)
# random_state: the seed value utilized by the random number generator for reproducible results
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=3)

import xgboost as xgb

# Create an XGBoost classifier object
gridXGBOOST = xgb.XGBClassifier(eval_metric="error")

# Train the XGBoost classifier on the training data
model = gridXGBOOST.fit(X_train, y_train)

05 — Evaluate Performance 🎯

The XPER library offers an intuitive and straightforward method to compute the predictive performance of a predictive model. Considering that the performance metric of interest is the Areas Under the ROC Curve (AUC), it might be measured on the test set as follows:

from XPER.compute.Performance import ModelPerformance

# Define the evaluation metric(s) for use
XPER = ModelPerformance(X_train.values,
y_train.values,
X_test.values,
y_test.values,
model)

# Evaluate the model performance using the desired metric(s)
PM = XPER.evaluate(["AUC"])

# Print the performance metrics
print("Performance Metrics: ", round(PM, 3))

06 — Calculate XPER values ⭐️

Finally, to elucidate the driving forces of the AUC the XPER values could be computed comparable to:

# Calculate XPER values for the model's performance
XPER_values = XPER.calculate_XPER_values(["AUC"],kernel=False)

The « XPER_values » is a tuple including two elements: the XPER values and the person XPER values of the features.

To be used cases above 10 feature variables it is suggested to used the default option kernel=True for computation efficiency ➡️

07 — Visualization 📊

from XPER.viz.Visualisation import visualizationClass as viz

labels = list(loan.drop(columns='Loan_Status').columns)

To research the driving force at the worldwide level, the XPER library proposes a bar plot representation of XPER values.

viz.bar_plot(XPER_values=XPER_values, X_test=X_test, labels=labels, p=5,percentage=True)

For ease of presentation, feature contributions are expressed in percentage of the spread between the AUC and its benchmark, i.e., 0.5 for the AUC, and are ordered from the most important to lowest. From this figure, we are able to see that greater than 78% of the over-performance of the model over a random predictor comes from Credit History, followed by Applicant Income contributing around 16% to the performance, and Co-applicant Income and Loan Amount Term each accounting for lower than 6%. Then again, we are able to see that the variable Loan Amount almost doesn’t help the model to raised predict the probability of default as its contribution is near 0.

The XPER library also proposes graphical representations to investigate XPER values on the local level. First, a force plot could be used to investigate driving forces of the performance for a given remark:

viz.force_plot(XPER_values=XPER_values, instance=1, X_test=X_test, variable_name=labels, figsize=(16,4))

The preceding code plots the positive (negative) XPER values of the remark #10 in red (blue), in addition to the benchmark (0.33) and contribution (0.46) of this remark to the AUC of the model. The over-performance of borrower #10 is on account of the positive XPER values of Loan Amount Term, Applicant Income, and Credit History. Then again, Co-Applicant Income and Loan Amount had a negative effect and decreased the contribution of this borrower.

We will see that while Applicant Income and Loan Amount have a positive effect on the AUC at the worldwide level, these variables have a negative effect for the borrower #10. Evaluation of individual XPER values can thus discover groups of observations for which features have different effects on performance, potentially highlighting an heterogeneity issue.

Second, it is feasible to represent the XPER values of every remark and have on a single plot. For that purpose, one can depend on a beeswarm plot which represents the XPER values for every feature as a function of the feature value.

viz.beeswarn_plot(XPER_values=XPER_values,X_test=X_test,labels=labels)

On this figure, each dot represents an remark. The horizontal axis represents the contribution of every remark to the performance of the model, while the vertical axis represents the magnitude of feature values. Similarly to the bar plot shown previously, features are ordered from those who contribute probably the most to the performance of the model to those who contribute the least. Nevertheless, with the beeswarm plot it is usually possible to investigate the effect of feature values on XPER values. In this instance, we are able to see large values of Credit History are related to relatively small contributions (in absolute value), whereas low values result in larger contributions (in absolute value).

All images, unless otherwise stated, are by the creator.

The contributors to this library are:

[1] L. Shapley, A Value for n-Person Games (1953), Contributions to the Theory of Games, 2:307–317

[2] S. Lundberg, S. Lee, A unified approach to interpreting model predictions (2017), Advances in Neural Information Processing Systems

[3] S. Hué, C. Hurlin, C. Pérignon, S. Saurin, Measuring the Driving Forces of Predictive Performance: Application to Credit Scoring (2023), HEC Paris Research Paper No. FIN-2022–1463

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x