Despite tabular data being the bread and butter of industry data science, data shifts are sometimes missed when analyzing model performance.
We’ve all been there: You develop a machine learning model, achieve great results in your validation set, after which deploy it (or test it) on a brand new, real-world dataset. Suddenly, performance drops.
So, what’s the problem?
Normally, we point the finger at Covariance Shift. The distribution of features in the brand new data is different from the training data. We use this as a “Get Out of Jail Free” card: “The information modified, so naturally, the performance is lower. It’s the information’s fault, not the model’s.”
But what if we stopped using covariance shift as an excuse and began using it as a tool?
I consider there’s a greater strategy to handle this and to create a “gold standard” for analyzing model performance. That method will allows us to estimate performance accurately, even when the bottom shifts beneath our feet.
The Problem: Comparing Apples to Oranges
Let’s have a look at an easy example from the medical world.
Imagine we trained a model on patients aged 40-89. Nevertheless, in our recent goal test data, the age range is stricter: 50-80.
If we simply run the model on the test data and compare it to our original validation scores, we’re misleading ourselves. To match “apples to apples,” an excellent data scientist would return to the validation set, filter for patients aged 50-80, and recalculate the baseline performance.
But let’s make it harder
Suppose our test dataset incorporates thousands and thousands of records aged 50-80, and one single patient aged 40.
- Will we compare our results to the validation 40-80 range?
- Will we compare to the 50-80 range?
If we ignore the particular age distribution (which most traditional analyses do), that single 40-year-old patient theoretically shifts the definition of the cohort. In practice, we would just delete that outlier. But what if there have been 100 or 1,000 patients aged below 50? Can we do higher? Can we automate this process to handle differences in multiple variables concurrently without manually filtering data? Moreover, filtering data isn’t an excellent solution. It only accounts for the fitting range but ignores the distribution shift inside that range.
The Solution: Inverse Probability Weighting
The answer is to mathematically re-weight our validation data to seem like the test data. As an alternative of binary inclusion/exclusion (keeping or dropping a row), we assign a continuous weight to every record in our validation set. It’s like an extension of the above easy filtering method to match the identical age range.
- Weight = 1: Standard evaluation.
- Weight = 0: Exclude the record (filtering).
- Weight is non-negative float: Down-sample or Up-sample the record’s influence.
The Intuition
In our example (Test: Age 50-80 + one 40yo), the answer is to mimic the test cohort inside our validation set. We wish our validation set to “pretend” it has the very same age distribution because the test set.
The Math
Let’s formalize this. We’d like to define two probabilities:
- Pt(x): The probability of seeing feature value x (e.g., Age) within the Goal Test data.
- Pv(x): The probability of seeing feature value x within the Validation data.
The burden w for any given record with feature x is the ratio of those probabilities:
w(x) := Pt(x) / Pv(x)
That is intuitive. If 60 12 months olds are rare in training (Pv is low) but common in production (Pt is high), the ratio is large. We weight these records up in our evaluation to match reality. However, in our example where the test set is strictly aged 50-80, any validation patients outside this range will receive a weight of 0 (since Pt(Age)=0). That is effectively the identical as excluding them, exactly as needed.
This can be a statistical technique often called Importance Sampling or Inverse Probability Weighting (IPW).
By applying these weights when calculating metrics (like Accuracy, AUC, or RMSE) in your validation set, you create an artificial cohort that completely matches the test domain. You’ll be able to now compare apples to apples without complaining in regards to the shift.
The Extension: Handling High-Dimensional Shifts
Doing this for one variable (Age) is simple. You’ll be able to just use histograms/bins. But what if the information shifts across dozens of various variables concurrently? We cannot construct a dozen dimensional histogram. The answer is a clever trick using a binary classifier.
We train a brand new model (a “Propensity Model,” let’s call it Mp) to differentiate between the 2 datasets.
- Input: The features of the record (Age, BMI, Blood Pressure, etc.) or our desired variables to manage for.
- Goal: 0 if the record is from Validation, 1 if the record is from the Test set.
If this model can easily tell the information apart (AUC > 0.5), it means there’s a covariate shift. The AUC of Mp also serves as a diagnostic tool. It interprets how different your test data from the validation set and the way vital was to account for it. Crucially, the probabilistic output of this model gives us exactly what we’d like to calculate the weights.
Using Bayes’ theorem, the burden for a sample x becomes the odds that the sample belongs to the test set:
() := () / (1 – ())
- If Mp(x) ~ 0.5, the information points are indistinguishable, and the burden is 1.
- If Mp(x) -> 1, the model could be very sure this looks like Test data, and the burden increases.
Note: Applying these weights doesn’t necessarily result in drop within the expected performance. In some cases, the test distribution might shift toward subgroups where your model is definitely more accurate. In that scenario, the tactic will scale up those instances and your estimated performance will reflect that.
Does it work?
Yes, like magic. For those who take your validation set, apply these weights, after which plot the distributions of your variables, they’ll perfectly overlay the distributions of your goal test set.
It’s much more powerful than that: it aligns the joint distribution of all variables, not only their individual distribution. Your weighted validation data becomes practically indistinguishable from the goal test data when the predictor is perfect.
This can be a generalization of the one variable we saw earlier and yield the very same result for a single variable. Intuitively Mp learns the differences between our test and validation datasets. We then utilize this learned ‘understanding’ to mathematically counter the difference.
You’ll be able to for instance have a look at this code snippet for generating 2 age distributions: one uniform(validation set), the opposite normal distribution (goal test set), with our weights.

Code Snippet
import pandas as pd
import numpy as np
import plotly.graph_objects as go
df = pd.DataFrame({"Age": np.random.randint(40,89, 10000) })
df2 = pd.DataFrame({"Age": np.random.normal(65, 10, 10000) })
df2["Age"] = df2["Age"].round().astype(int)
df2 = df2[df2["Age"].between(40,89)].reset_index(drop=True)
df3 = df.copy()
def get_fig(df:pd.DataFrame, title:str):
if 'weight' not in df.columns:
df["weight"] = 1
age_count = df.groupby("Age")["weight"].sum().reset_index().sort_values("Age")
tot = df["weight"].sum()
age_count["Percentage"] = 100 * age_count["weight"] / tot
f = go.Bar(x=age_count["Age"], y=age_count["Percentage"], name=title)
return f, age_count
f1, age_count1 = get_fig(df, "ValidationSet")
f2, age_count2 = get_fig(df2, "TargetTestSet")
age_stats = age_count1[["Age", "Percentage"]].merge(age_count2[["Age", "Percentage"]].rename(columns={"Percentage": "Percentage2"}), on=["Age"])
age_stats["weight"] = age_stats["Percentage2"] / age_stats["Percentage"]
df3 = df3.merge(age_stats[["Age", "weight"]], on=["Age"])
f3, _ = get_fig(df3, "ValidationSet-Weighted")
fig = go.Figure(layout={"title":"Age Distribution"})
fig.add_trace(f1)
fig.add_trace(f2)
fig.add_trace(f3)
fig.update_xaxes(title_text='Age') # Set the x-axis title
fig.update_yaxes(title_text='Percentage') # Set the y-axis title
fig.show()
Limitations
While this can be a powerful technique, it doesn’t all the time work. There are three important statistical limitations:
- Hidden Confounders: If the shift is brought on by a variable you didn’t measure (e.g., a genetic marker you don’t have in your tabular data), you can’t weigh for it. Nevertheless, as model developers, we normally try to make use of probably the most predictive features in our model when possible.
- Ignorability (Lack of Overlap): You can’t divide by zero. If Pv(x) is zero (e.g., your training data has no patients over 90, however the test set does), the burden explodes to infinity.
- The Fix: Discover these non-overlapping groups. In case your validation set literally incorporates zero details about a selected sub-population, you could explicitly exclude that sub-population from the comparison and flag it as “unknown territory”.
- Propensity Model Quality: Since we depend on a model (Mp) to estimate weights, any inaccuracies or poor calibration on this model will introduce noise. For low-dimensional shifts (like a single ‘Age’ variable), that is negligible, but for high-dimensional complex shifts, ensuring Mp is well-calibrated is critical.
Although the propensity model isn’t perfect in practice, applying these weights significantly reduces the distribution shift. This provides a rather more accurate proxy for real world performance than doing nothing in any respect.
A Note on Statistical Power
Remember that using weights changes your Effective Sample Size. High variance weights reduce the soundness of your estimates.
Bootstrapping: For those who use bootstrapping, you’re protected so long as you incorporate the weights into the resampling process itself.
Power Calculations: Don’t use the raw variety of rows (N). Please check with the Effective Sample Size formula (Kish’s ESS) to know the true power of your weighted evaluation.
What about images and texts?
The propensity model method works in those domains as well. Nevertheless, the important issue from a practical perspective is commonly ignorability. There’s a whole separation between our validation and the goal test set which ends up in inability to counter the shift. It doesn’t mean our model will perform poorly on those datasets. It simply means we cannot estimates its performance based in your current validation which is totally different.
Summary
One of the best practice for evaluating model performance on tabular data is to strictly account for covariance shift. As an alternative of using shift as an excuse for poor performance, use Inverse Probability Weighting to estimate how your model should perform in the brand new environment.
This permits you to answer one among the toughest query in deployment: “Is the performance drop on account of the information changing, or is the model actually broken?”
For those who utilize this method, you possibly can explain the gap between training and production metrics.
For those who found this handy, let’s connect on LinkedIn
