Drift Detection in Robust Machine Learning Systems

was co-authored by Sebastian Humberg and Morris Stallmann.

Introduction

Machine learning (ML) models are designed to make accurate predictions based on patterns in historical data. But what if these patterns change overnight? For example, in bank card fraud detection, today’s legitimate transaction patterns might look suspicious tomorrow as criminals evolve their tactics and honest customers change their habits. Or picture an e-commerce recommender system: what worked for summer shoppers may suddenly flop as winter holidays sweep in latest trends. This subtle, yet relentless, shifting of knowledge, often called drift, can quietly erode your model’s performance, turning yesterday’s accurate predictions into today’s costly mistakes.

In this text, we’ll lay the inspiration for understanding drift: what it’s, why it matters, and the way it may possibly sneak up on even the most effective machine learning systems. We’ll break down the 2 important kinds of drift: data drift and concept drift. Then, we move from theory to practice by outlining robust frameworks and statistical tools for detecting drift before it derails your models. Finally, you’ll get a look into what to do against drift, so your machine learning systems remain resilient in a always evolving world.

What’s drift?

Drift refers to unexpected changes in the information distribution over time, which may negatively impact the performance of predictive models. ML models solve prediction tasks by applying patterns that the model learned from historical data. More formally, in supervised ML, the model learns a joint distribution of some set of feature vectors X and goal values y from all data available at time t₀:

[P_{t_{0}}(X, y) = P_{t_{0}}(X) times P_{t_{0}}(y|X)]

After training and deployment, the model will probably be applied to latest data to predict under the idea that the brand new data follows the identical joint distribution. Nevertheless, if that assumption is violated, then the model’s predictions may now not be reliable, because the patterns within the training data could have turn out to be irrelevant. The violation of that assumption, namely the change of the joint distribution, known as drift. Formally, we are saying drift has occurred if:

[P_{t_0} (X,y) ne P_{t}(X,y).]

for some .

The Predominant Varieties of Drift: Data Drift and Concept Drift

Generally, drift occurs when the joint probability changes over time. But when we glance more closely, we notice there are different sources of drift with different implications for the ML system. On this section, we introduce the notions of data drift and concept drift.

Recall that the joint probability will be decomposed as follows:

[P(X,y) = P(X) times P(y|X).]

Depending on which a part of the joint distribution changes, we either discuss data drift or concept drift.

Data Drift

If the distribution of the features changes, then we speak of knowledge drift:

[ P_{t_0}(X) ne P_{t}(X), t_0 > t. ]

Note that data drift doesn’t necessarily mean that the connection between the goal values and the features has modified. Hence, it is feasible that the machine learning model still performs reliably even after the occurrence of knowledge drift.

Generally, nevertheless, data drift often coincides with concept drift and will be a superb early indicator of model performance degradation. Especially in scenarios where ground truth labels should not (immediately) available, detecting data drift will be a very important component of a drift warning system. For instance, consider the COVID-19 pandemic, where the input data distribution of patients, reminiscent of symptoms, modified for models attempting to predict clinical outcomes. This variation in clinical outcomes was a drift in concept and would only be observable after some time. To avoid incorrect treatment based on outdated model predictions, it will be important to detect and signal data drift that will be observed immediately.

Furthermore, drift may also occur in unsupervised ML systems where goal values should not of interest in any respect. In such unsupervised systems, only data drift is defined.

Data drift is a shift within the distribution (figure created by the authors and inspired by Evidently AI).

Concept Drift

Concept drift is the change in the connection between goal values and features over time:

[P_{t_0}(y|X) ne P_{t}(y|X), t_0 > t.]

Normally, performance is negatively impacted if concept drift occurs.

In practice, the bottom truth label often only becomes available with a delay (or under no circumstances). Hence, also observing may only be possible with a delay. Due to this fact, in lots of scenarios, detecting concept drift in a timely and reliable manner will be far more involved and even unimaginable. In such cases, we may have to depend on data drift as an indicator of concept drift.

How Drift Can Evolve Over Time

Drift evolution patterns over time (Figure from Towards Unsupervised Sudden Data Drift Detection in Federated Learning with Fuzzy Clustering).

Concept and data drift can take different forms, and these forms could have various implications for drift detection and drift handling strategies.

Drift may occur suddenly with abrupt distribution changes. For instance, purchasing behavior may change overnight with the introduction of a brand new product or promotion.

In other cases, drift may occur more progressively or incrementally over an extended time period. For example, if a digital platform introduces a brand new feature, this may increasingly affect user behavior on that platform. While at first, only just a few users adopted the brand new feature, increasingly more users may adopt it in the long term. Lastly, drift could also be recurring and driven by seasonality. Imagine a clothing company. While in the summertime the corporate’s top-selling products could also be T-shirts and shorts, those are unlikely to sell equally well in winter, when customers could also be more interested by coats and other warmer clothing items.

The way to Discover Drift

A mental framework for identifying drift (figure created by the authors).

Before drift will be handled, it should be detected. To debate drift detection effectively, we introduce a mental framework borrowed from the superb read “Learning under Concept Drift: A review” (see reference list). A drift detection framework will be described in three stages:

Data Collection and Modelling: The information retrieval logic specifies the information and time periods to be compared. Furthermore, the information is ready for the subsequent steps by applying an information model. This model might be a machine learning model, histograms, and even no model in any respect. We’ll see examples in subsequent sections.
Test Statistic Calculation: The test statistic defines how we measure (dis)similarity between historical and latest data. For instance, by comparing model performance on historical and latest data, or by measuring how different the information chunks’ histograms are.
Hypothesis Testing: Finally, we apply a hypothesis test to come to a decision whether we wish the system to signal drift. We formulate a null hypothesis and a call criterion (reminiscent of defining a -value).

Data Collection and Modelling

On this stage, we define exactly which chunks of knowledge will probably be compared in subsequent steps. First, the time windows of our reference and comparison (i.e., latest) data should be defined. The reference data could strictly be the historical training data (see figure below), or change over time as defined by a sliding window. Similarly, the comparison data can strictly be the latest batches of knowledge, or it may possibly extend the historical data over time, where each time windows will be sliding.

Once the information is on the market, it must be prepared for the test statistic calculation. Depending on the statistic, it’d should be fed through a machine learning model (e.g., when calculating performance metrics), transformed into histograms, or not be processed in any respect.

Data collection techniques (figure from “Learning under Concept Drift: A Review”).

Drift Detection Methods

One can discover drift by applying certain detection methods. These methods monitor the performance of a model (concept drift detection) or directly analyse incoming data (data drift detection). By applying various statistical tests or monitoring metrics, drift detection methods help to maintain your model reliable. Either through easy threshold-based approaches or advanced techniques, these methods guarantee the robustness and adaptivity of your machine learning system.

Observing Concept Drift Through Performance Metrics

Observable ML model performance degradation as a consequence of drift (figure created by the authors).

Probably the most direct option to spot concept drift (or its consequences) is by tracking the model’s performance over time. Given two time windows and , we calculate the performance and . Then, the test statistic will be defined because the difference (or dissimilarity) of performance:

[dis = |p_{[t_0, t_1]} – p_{[t_2, t_3]}|.]

Performance will be any metric of interest, reminiscent of accuracy, precision, recall, F1-score (in classification tasks), or mean squared error, mean absolute percentage error, R-squared, etc. (in regression problems).

Calculating performance metrics often requires ground truth labels that will only turn out to be available with a delay, or may never turn out to be available.

To detect drift in a timely manner even in such cases, proxy performance metrics can sometimes be derived. For instance, in a spam detection system, we’d never know whether an email was actually spam or not, so we cannot calculate the accuracy of the model on live data. Nevertheless, we’d find a way to look at a proxy metric: the proportion of emails that were moved to the spam folder. If the speed changes significantly over time, this might indicate concept drift.

If such proxy metrics should not available either, we will base the detection framework on data distribution-based metrics, which we introduce in the subsequent section.

Data Distribution-Based Methods

Methods on this category quantify how dissimilar the information distributions of reference data X_[t0,t1] and latest data X_[t2,t3] are without requiring ground truth labels.

How can the dissimilarity between two distributions be quantified? In the subsequent subsections, we’ll introduce some popular univariate and multivariate metrics.

Univariate Metrics

Let’s start with a quite simple univariate approach:

First, calculate the technique of the -th feature within the reference and latest data. Then, define the differences of means because the dissimilarity measure

[dis_i = |mean_{i}^{[t_0,t_1]} – mean_{i}^{[t_2,t_3]}|. ]

Finally, signal drift if is unexpectedly big. We signal drift each time we observe an unexpected change in a feature’s mean over time. Other similar easy statistics include the minimum, maximum, quantiles, and the ratio of null values in a column. These are easy to calculate and are a wonderful start line for constructing drift detection systems.

Nevertheless, these approaches will be overly simplistic. For instance, calculating the mean misses changes within the tails of the distribution, as would other easy statistics. For this reason we want barely more involved data drift detection methods.

Kolmogorov-Smirnov (K-S) test statistic (figure from WIkipedia).

One other popular univariate method is the Kolmogorov-Smirnov (K-S) test. The KS test examines your entire distribution of a single feature and calculates the cumulative distribution function (CDF) of and . Then, the test statistic is calculated as the utmost difference between the 2 distributions:

[ dis_i = sup |CDF(X(i)_{[t_0,t_1]})-CDF(X(i)_{[t_2,t_3]})|, ]

and may detect differences within the mean and the tails of the distribution.

The null hypothesis is that each one samples are drawn from the identical distribution. Hence, if the p-value is lower than a predefined value of 𝞪 (e.g., 0.05), then we reject the null hypothesis and conclude drift. To find out the critical value for a given 𝞪, we want to seek the advice of a two-sample KS table. Or, if the sample sizes (variety of reference samples) and (number of recent samples) are large, the critical value is calculated in line with

[cv_{alpha}= c(alpha)sqrt{ frac{n+m}{n*m} }, ]

where will be found here on Wikipedia for common values.

The K-S test is widely utilized in drift detection and is comparatively robust against extreme values. Nevertheless, bear in mind that even small numbers of utmost outliers can disproportionately affect the dissimilarity measure and result in false positive alarms.

Population Stability Index

Bin distribution for Popularity Stability Index test statistic calculation (figure created by the authors).

A good less sensitive alternative (or complement) is the population stability index (PSI). As a substitute of using cumulative distribution functions, the PSI involves dividing the range of observations into bins and calculating frequencies for every bin, effectively generating histograms of the reference and latest data. We compare the histograms, and in the event that they appear to have modified unexpectedly, the system signals drift. Formally, the dissimilarity is calculated in line with:

[dis = sum_{bin B} (ratio(b^{new}) – ratio(b^{ref}))ln(frac{ratio(b^{new})}{ratio(b^{ref})}) = sum_{bin B} PSI_{b}, ]

where is the ratio of knowledge points falling into bin in the brand new dataset, and is the ratio of knowledge points falling into bin within the reference dataset, is the set of all bins. The smaller the difference between and , the smaller the PSI. Hence, if an enormous PSI is observed, then a drift detection system would signal drift. In practice, often a threshold of 0.2 or 0.25 is applied as a rule of thumb. That’s, if the > 0.25, the system signals drift.

Chi-Squared Test

Lastly, we introduce a univariate drift detection method that will be applied to categorical features. All previous methods only work with numerical features.

So, let be a categorical feature with categories. Calculating the chi-squared test statistic is somewhat just like calculating the PSI from the previous section. Slightly than calculating the histogram of a continuous feature, we now consider the (relative) counts per category . With these counts, we define the dissimilarity because the (normalized) sum of squared frequency differences within the reference and latest data:

[dis = sum_{i=1}^{n} frac{(count_{i}^{new}-count_{i}^{ref})^{2}}{count_{i}^{ref}}].

Note that in practice it’s possible you’ll have to resort to relative counts if the cardinalities of recent and reference data are different.

To choose whether an observed dissimilarity is critical (with some pre-defined value), a table of chi-squared values with one degree of freedom is consulted, e.g., Wikipedia.

Multivariate Tests

In lots of cases, each feature’s distribution individually is probably not affected by drift in line with the univariate tests within the previous section, but the general distribution should still be affected. For instance, the correlation between and may change while the histograms of each (and, hence, the univariate PSI) seem like stable. Clearly, such changes in feature interactions can severely impact machine learning model performance and should be detected. Due to this fact, we introduce a multivariate test that may complement the univariate tests of the previous sections.

Reconstruction-Error Based Test

A schematic overview of autoencoder architectures (figure from Wikipedia)

This approach relies on self-supervised autoencoders that will be trained without labels. Such models consist of an encoder and a decoder part, where the encoder maps the information to a, typically low-dimensional, latent space and the decoder learns to reconstruct the unique data from the latent space representation. The educational objective is to attenuate the reconstruction errori.e., the difference between the unique and reconstructed data.

How can such autoencoders be used for drift detection? First, we train the autoencoder on the reference dataset, and store the mean reconstruction error. Then, using the identical model, we calculate the reconstruction error on latest data and use the difference because the dissimilarity metric:

[ dis = |error_{[t_0, t_1]} – error_{[t_2, t_3]}|. ]

Intuitively, if the brand new and reference data are similar, the unique model shouldn’t have problems reconstructing the information. Hence, if the dissimilarity is larger than a predefined threshold, the system signals drift.

This approach can spot more subtle multivariate drift. Note that principal component evaluation will be interpreted as a special case of autoencoders. NannyML demonstrates how PCA reconstructions can discover changes in feature correlations that univariate methods miss.

Summary of Popular Drift Detection Methods

To conclude this section, we would really like to summarize the drift detection methods in the next table:

Name	Applied to	Test statistic	Drift if	Notes
Statistical and threshold-based tests	Univariate, numerical data	Differences in easy statistics like mean, quantiles, counts, etc.	The difference is larger than a predefined threshold	May miss differences in tails of distributions, setting the edge requires domain knowledge or gut feeling
Kolmogorov-Smirnov (K-S)	Univariate, numerical data	Maximum difference within the cumulative distribution function of reference and latest data.	p-value is small (e.g., p < 0.05)	Will be sensitive to outliers
Population Stability Index (PSI)	Univariate, numerical data	Differences within the histogram of reference and latest data.	PSI is larger than the predefined threshold (e.g., PSI > 0.25)	Selecting a threshold is usually based on gut feeling
Chi-Squared Test	Univariate, categorical data	Differences in counts of observations per category in reference and latest data.	p-value is small (e.g., p < 0.05)
Reconstruction-Error Test	Multivariate, numerical data	Difference in mean reconstruction error in reference and latest data	The difference is larger than the predefined threshold	Defining a threshold will be hard; the tactic could also be relatively complex to implement and maintain.

What to Do Against Drift

Regardless that the main target of this text is the detection of drift, we’d also like to provide an idea of what will be done against drift.

As a general rule, it will be important to automate drift detection and mitigation as much as possible and to define clear responsibilities ensure ML systems remain relevant.

First Line of Defense: Robust Modeling Techniques

The primary line of defense is applied even before the model is deployed. Training data and model engineering decisions directly impact sensitivity to drift, and model developers should concentrate on robust modeling techniques or robust machine learning. For instance, a machine learning model counting on many features could also be more prone to the results of drift. Naturally, more features mean a bigger “attack surface”, and a few features could also be more sensitive to drift than others (e.g., sensor measurements are subject to noise, whereas sociodemographic data could also be more stable). Investing in robust feature selection is more likely to repay in the long term.

Moreover, including noisy or malicious data within the training dataset may make models more robust against smaller distributional changes. The sector of adversarial machine learning is anxious with teaching ML models take care of adversarial inputs.

Second Line of Defense: Define a Fallback Strategy

Even probably the most fastidiously engineered model will likely experience drift sooner or later. When this happens, make sure that to have a backup plan ready. To arrange such a plan, first, the results of failure should be understood. Recommending the improper pair of shoes in an email newsletter has very different implications from misclassifying objects in autonomous driving systems. In the primary case, it could be acceptable to attend for human feedback before sending the e-mail if drift is detected. Within the latter case, a far more immediate response is required. For instance, a rule-based system or every other system not affected by drift may take over.

Striking Back: Model Updates

After addressing the immediate effects of drift, you possibly can work to revive the model’s performance. Probably the most obvious activity is retraining the model or updating model weights with the latest data. One among the challenges of retraining is defining a brand new training dataset. Should it include all available data? Within the case of concept drift, this may increasingly harm convergence for the reason that dataset may contain inconsistent training samples. If the dataset is just too small, this may increasingly result in catastrophic forgetting of previously learned patterns for the reason that model is probably not exposed to enough training samples.

To forestall catastrophic forgetting, methods from continual and lively learning will be applied, e.g., by introducing memory systems.

It is crucial to weigh different options, pay attention to the trade-offs, and make a call based on the impact on the use case.

Conclusion

In this text, we describe why drift detection is essential for those who care concerning the long-term success and robustness of machine learning systems. If drift occurs and shouldn’t be taken care of, then machine learning models’ performance will degrade, potentially harming revenue, eroding trust and repute, and even having legal consequences.

We formally introduce concept and data drift as unexpected differences between training and inference data. Such unexpected changes will be detected by applying univariate tests just like the Kolmogorov-Smirnov test, Population Stability Index tests, and the Chi-Square test, or multivariate tests like reconstruction-error-based tests. Lastly, we briefly touch upon just a few strategies about take care of drift.

In the long run, we plan to follow up with a hands-on guide constructing on the concepts introduced in this text. Finally, one last note: While the article introduces several increasingly more complex methods and ideas, keep in mind that any drift detection is all the time higher than no drift detection. Depending on the use case, a quite simple detection system can prove itself to be very effective.

Drift Detection in Robust Machine Learning Systems

Introduction

What’s drift?