When Shapley Values Break: A Guide to Robust Model Explainability

Explainability in AI is important for gaining trust in model predictions and is extremely essential for improving model robustness. Good explainability often acts as a debugging tool, revealing flaws within the model training process. While Shapley Values have grow to be the industry standard for this task, we must ask: Do they all the time work? And critically, where do they fail?

To know where Shapley values fail, the perfect approach is to manage the bottom truth. We’ll start with an easy linear model, after which systematically break down the reason. By observing how Shapley values react to those controlled changes, we will precisely discover exactly where they yield misleading results and how one can fix them.

The Toy Model

We’ll start with a model with 100 uniform random variables.

import numpy as np
from sklearn.linear_model import LinearRegression
import shap

def get_shapley_values_linear_independent_variables(
    weights: np.ndarray, data: np.ndarray
) -> np.ndarray:
    return weights * data

# Top compare the theoretical results with shap package
def get_shap(weights: np.ndarray, data: np.ndarray):
    model = LinearRegression()
    model.coef_ = weights  # Inject your weights
    model.intercept_ = 0
    background = np.zeros((1, weights.shape[0]))
    explainer = shap.LinearExplainer(model, background) # Assumes independent between all features
    results = explainer.shap_values(data) 
    return results

DIM_SPACE = 100

np.random.seed(42)
# Generate random weights and data
weights = np.random.rand(DIM_SPACE)
data = np.random.rand(1, DIM_SPACE)

# Set specific values to check our intuition
# Feature 0: High weight (10), Feature 1: Zero weight
weights[0] = 10
weights[1] = 0
# Set maximal value for the primary two features
data[0, 0:2] = 1

shap_res = get_shapley_values_linear_independent_variables(weights, data)
shap_res_pacakge = get_shap(weights, data)
idx_max = shap_res.argmax()
idx_min = shap_res.argmin()

print(
    f"Expected: idx_max 0, idx_min 1nActual: idx_max {idx_max},  idx_min: {idx_min}"
)

print(abs(shap_res_pacakge - shap_res).max()) # No difference

On this straightforward example, where all variables are independent, the calculation simplifies dramatically.

Recall that the Shapley formula is predicated on the marginal contribution of every feature, the difference within the model’s output when a variable is added to a coalition of known features versus when it’s absent.

[ V(S∪{i}) – V(S)
]

Because the variables are independent, the precise combination of pre-selected features (S) doesn’t influence the contribution of feature i. The effect of pre-selected and non-selected features cancel one another out in the course of the subtraction, having no impact on the influence of feature i. Thus, the calculation reduces to measuring the marginal effect of feature i directly on the model output:

[ W_i · X_i ]

The result’s each intuitive and works as expected. Because there isn’t a interference from other features, the contribution depends solely on the feature’s weight and its current value. Consequently, the feature with the most important combination of weight and value is essentially the most contributing feature. In our case, feature index 0 has a weight of 10 and a price of 1.

Let’s Break Things

Now, we are going to introduce dependencies to see where Shapley values begin to fail.

On this scenario, we are going to artificially induce perfect correlation by duplicating essentially the most influential feature (index 0) 100 times. This leads to a brand new model with 200 features, where 100 features are equivalent copies of our original top contributor and independent of the remaining of the 99 features. To finish the setup, we assign a zero weight to all these added duplicate features. This ensures the model’s predictions remain unchanged. We’re only altering the structure of the input data, not the output. While this setup seems extreme, it mirrors a typical real-world scenario: taking a known essential signal and creating multiple derived features (comparable to rolling averages, lags, or mathematical transformations) to raised capture its information.

Nevertheless, because the unique Feature 0 and its recent copies are perfectly dependent, the Shapley calculation changes.

Based on the Symmetry Axiom: if two features contribute equally to the model (on this case, by carrying the identical information), they have to receive equal credit.

Intuitively, knowing the worth of anyone clone reveals the complete information of the group. Because of this, the large contribution we previously saw for the only feature is now split equally across it and its 100 clones. The “signal” gets diluted, making the first driver of the model appear much less essential than it actually is.
Here is the corresponding code:

import numpy as np
from sklearn.linear_model import LinearRegression
import shap

def get_shapley_values_linear_correlated(
    weights: np.ndarray, data: np.ndarray
) -> np.ndarray:
    res = weights * data
    duplicated_indices = np.array(
        [0] + list(range(data.shape[1] - DUPLICATE_FACTOR, data.shape[1]))
    )
    # we are going to sum those contributions and split contribution amongst them
    full_contrib = np.sum(res[:, duplicated_indices], axis=1)
    duplicate_feature_factor = np.ones(data.shape[1])
    duplicate_feature_factor[duplicated_indices] = 1 / (DUPLICATE_FACTOR + 1)
    full_contrib = np.tile(full_contrib, (DUPLICATE_FACTOR+1, 1)).T
    res[:, duplicated_indices] = full_contrib
    res *= duplicate_feature_factor
    return res

def get_shap(weights: np.ndarray, data: np.ndarray):
    model = LinearRegression()
    model.coef_ = weights  # Inject your weights
    model.intercept_ = 0
    explainer = shap.LinearExplainer(model, data, feature_perturbation="correlation_dependent")    
    results = explainer.shap_values(data)
    return results

DIM_SPACE = 100
DUPLICATE_FACTOR = 100

np.random.seed(42)
weights = np.random.rand(DIM_SPACE)
weights[0] = 10
weights[1] = 0
data = np.random.rand(10000, DIM_SPACE)
data[0, 0:2] = 1

# Duplicate copy of feature 0, 100 times:
dup_data = np.tile(data[:, 0], (DUPLICATE_FACTOR, 1)).T
data = np.concatenate((data, dup_data), axis=1)
# We'll put zero weight for all those added features:
weights = np.concatenate((weights, np.tile(0, (DUPLICATE_FACTOR))))


shap_res = get_shapley_values_linear_correlated(weights, data)

shap_res = shap_res[0, :] # Take First record to check results
idx_max = shap_res.argmax()
idx_min = shap_res.argmin()

print(f"Expected: idx_max 0, idx_min 1nActual: idx_max {idx_max},  idx_min: {idx_min}")

That is clearly not what we intended and fails to supply a superb explanation to model behavior. Ideally, we wish the reason to reflect the bottom truth: Feature 0 is the first driver (with a weight of 10), while the duplicated features (indices 101–200) are merely redundant copies with zero weight. As an alternative of diluting the signal across all copies, we might clearly prefer an attribution that highlights the true source of the signal.

Note: Should you run this using Python shap package, you would possibly notice the outcomes are similar but not equivalent to our manual calculation. It is because calculating Shapley values is computationally infeasible. Subsequently libraries like shap depend on approximation methods which barely introduce variance.

Image by creator (generated with Google Gemini).

Can We Fix This?

Since correlation and dependencies between features are extremely common, we cannot ignore this issue.

On the one hand, Shapley values do account for these dependencies. A feature with a coefficient of 0 in a linear model and no direct effect on the output receives a non-zero contribution since it accommodates information shared with other features. Nevertheless, this behavior, driven by the Symmetry Axiom, isn’t all the time what we wish for practical explainability. While “fairly” splitting the credit amongst correlated features is mathematically sound, it often hides the true drivers of the model.

Several techniques can handle this, and we are going to explore them.

Grouping Features

This approach is especially critical for high-dimensional feature space models, where feature correlation is inevitable. In these settings, attempting to attribute specific contributions to each variable is usually noisy and computationally unstable. As an alternative, we will aggregate similar features that represent the identical concept right into a single group. A helpful analogy is from image classification: if we wish to clarify why a model predicts “cat” as an alternative of a “dog”, examining individual pixels isn’t meaningful. Nevertheless, if we group pixels into “patches” (e.g., ears, tail), the reason becomes immediately interpretable. By applying this same logic to tabular data, we will calculate the contribution of the group reasonably than splitting it arbitrarily amongst its components.

This may be achieved in two ways: by simply summing the Shapley values inside each group or by directly calculating the group’s contribution. Within the direct method, we treat the group as a single entity. As an alternative of toggling individual features, we treat the presence and absence of the group as simultaneous presence or absence of all features inside it. This reduces the dimensionality of the issue, making the estimation faster, more accurate, and more stable.

Image by creator (generated with Google Gemini).

The Winner Takes It All

While grouping is effective, it has limitations. It requires defining the groups beforehand and infrequently ignores correlations between those groups.

This results in “explanation redundancy”. Returning to our example, if the 101 cloned features are usually not pre-grouped, the output will repeat those 101 features with the identical contribution 101 times. That is overwhelming, repetitive, and functionally useless. Effective explainability should reduce the redundancy and show something recent to the user every time.

To realize this, we will create a greedy iterative process. As an alternative of calculating all values directly, we will select features step-by-step:

Select the “Winner”: Discover the only feature (or group) with the best individual contribution
Condition the Next Step: Re-evaluate the remaining features, assuming the features from the previous step are already known. We’ll incorporate them within the subset of pre-selected features S within the shapley value every time.
Repeat: Ask the model: “On condition that the user already knows about Feature A, B, C, which remaining feature contributes essentially the most information?”

By recalculating Shapley values (or marginal contributions) conditioned on the pre-selected features, we be sure that redundant features effectively drop to zero. If Feature A and Feature B are equivalent and Feature A is chosen first, Feature B not provides recent information. It’s robotically filtered out, leaving a clean, concise list of distinct drivers.

Note: You will discover an implementation of this direct group and greedy iterative calculation in our Python package medpython.
Full disclosure: I’m a co-author of this open-source package.

Real World Validation

While this toy model demonstrates mathematical flaws in shapley values method, how does it work in real-life scenarios?

We applied those methods of Grouped Shapley with Winner takes all of it, moreover with more methods (which might be out of scope for this post, perhaps next time), in complex clinical settings utilized in healthcare. Our models utilize tons of of features with strong correlation that were grouped into dozens of concepts.

This method was validated across several models in a blinded setting when our clinicians weren’t aware which method they were inspecting, and outperformed the vanilla Shapley values by their rankings. Each technique contributed above the previous experiment in a multi-step experiment. Moreover, our team utilized these explainability enhancements as a part of our submission to the CMS Health AI Challenge, where we were chosen as award winners.

Image by the **Centers for Medicare & Medicaid Services (CMS)**

Conclusion

Shapley values are the gold standard for model explainability, providing a mathematically rigorous approach to attribute credit.
Nevertheless, as we’ve seen, mathematical “correctness” doesn’t all the time translate into effective explainability.

When features are highly correlated, the signal may be diluted, hiding the true drivers of your model behind a wall of redundancy.

We explored two ways to repair this:

Grouping: Aggregate features right into a single concept
Iterative Selection: conditioning on already presented concepts to squeeze out only recent information, effectively stripping away redundancy.

By acknowledging these limitations, we will ensure our explanations are meaningful and helpful.

Should you found this handy, let’s connect on LinkedIn

When Shapley Values Break: A Guide to Robust Model Explainability

The Toy Model

Let’s Break Things

Can We Fix This?

Grouping Features

The Winner Takes It All

Real World Validation

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

Statement from Dario Amodei on our discussions with the Department of War Anthropic

Google quantum-proofs HTTPS by squeezing 2.5kB of information into 64-byte space – Ars Technica

Generative AI, Discriminative Human

When Shapley Values Break: A Guide to Robust Model Explainability

The Toy Model

Let’s Break Things

Can We Fix This?

Grouping Features

The Winner Takes It All

Real World Validation

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.