Evaluating Model Retraining Strategies

How data drift and concept drift matter to decide on the correct retraining strategy?

Many individuals in the sphere of MLOps have probably heard a story like this:

Company A launched into an ambitious quest to harness the ability of machine learning. It was a journey fraught with challenges, because the team struggled to pinpoint a subject that will not only leverage the prowess of machine learning but additionally deliver tangible business value. After many brainstorming sessions, they finally settled on a use case that promised to revolutionize their operations. With excitement, they contracted Company B, a reputed expert, to construct and deploy a ML model. Following months of rigorous development and testing, the model passed all acceptance criteria, marking a big milestone for Company A, who looked forward to future opportunities.

Nevertheless, as time passed, the model began producing unexpected results, rendering it ineffective for its intended use. Company A reached out to Company B for advice, only to learn that the modified circumstances required constructing a brand new model, necessitating a fair higher investment as the unique.

What went mistaken? Was the model Company B created not pretty much as good as expected? Was Company A just unlucky that something unexpected happened?

Probably the difficulty was that even probably the most rigorous testing of a model before deployment doesn’t guarantee that this model will perform well for an infinite period of time. The 2 most vital elements that impact a model’s performance over time are data drift and concept drift.

Data Drift: Also often known as covariate shift, this happens when the statistical properties of the input data change over time. If an ML model was trained on data from a particular demographic however the demographic characteristics of the input data change, the model’s performance can degrade. Imagine you taught a baby multiplication tables until 10. It could actually quickly offer you the right answers for what’s 3 * 7 or 4 * 9. Nevertheless, one time you ask what’s 4 * 13, and although the principles of multiplication didn’t change it might offer you the mistaken answer since it didn’t memorize the answer.

Concept Drift: This happens when the connection between the input data and the goal variable changes. This could result in a degradation in model performance because the model’s predictions not align with the evolving data patterns. An example here may very well be spelling reforms. While you were a baby, you might have learned to jot down “co-operate”, nevertheless now it’s written as “cooperate”. Although you mean the identical word, your output of writing that word has modified over time.

In this text I investigate how different scenarios of information drift and concept drift impact a model’s performance over time. Moreover, I show what retraining strategies can mitigate performance degradation.

I concentrate on evaluating retraining strategies with respect to the model’s prediction performance. In practice more elements like:

Data Availability and Quality: Be sure that sufficient and high-quality data is offered for retraining the model.
Computational Costs: Evaluate the computational resources required for retraining, including hardware and processing time.
Business Impact: Consider the potential impact on business operations and outcomes when selecting a retraining strategy.
Regulatory Compliance: Be sure that the retraining strategy complies with any relevant regulations and standards, e.g. anti-discrimination.

must be considered to discover an appropriate retraining strategy.

To spotlight the differences between data drift and concept drift I synthesized datasets where I controlled to what extent these elements appear.

I generated datasets in 100 steps where I modified parameters incrementally to simulate the evolution of the dataset. Each step incorporates multiple data points and may be interpreted as the quantity of information that was collected over an hour, a day or every week. After every step the model was re-evaluated and may very well be retrained.

To create the datasets, I first randomly sampled features from a standard distribution where mean µ and standard deviation σ rely upon the step number s:

The information drift of feature xi depends upon how much µi and σi are changing with respect to the step number s.

All features are aggregated as follows:

Where ci are coefficients that describe the impact of feature xi on X. Concept drift may be controlled by changing these coefficients with respect to s. A random number ε which isn’t available for model training is added to contemplate that the features don’t contain complete information to predict the goal y.

The goal variable y is calculated by inputting X right into a non-linear function. By doing this we create a tougher task for the ML model since there isn’t any linear relation between the features and the goal. For the scenarios in this text, I selected a sine function.

I created the next scenarios to research:

Regular State: simulating no data or concept drift — parameters µ, σ, and c were independent of step s
Distribution Drift: simulating data drift — parameters µ, σ were linear functions of s, parameters c is independent of s
Coefficient Drift: simulating concept drift: parameters µ, σ were independent of s, parameters c are a linear function of s
Black Swan: simulating an unexpected and sudden change — parameters µ, σ, and c were independent of step s aside from one step when these parameters were modified

The COVID-19 pandemic serves as a quintessential example of a Black Swan event. A Black Swan is characterised by its extreme rarity and unexpectedness. COVID-19 couldn’t have been predicted to mitigate its effects beforehand. Many deployed ML models suddenly produced unexpected results and needed to be retrained after the outbreak.

For every scenario I used the primary 20 steps as training data of the initial model. For the remaining steps I evaluated three retraining strategies:

None: No retraining — the model trained on the training data was used for all remaining steps.
All Data: All previous data was used to coach a brand new model, e.g. the model evaluated at step 30 was trained on the info from step 0 to 29.
Window: A hard and fast window size was used to pick out the training data, e.g. for a window size of 10 the training data at step 30 contained step 20 to 29.

I used a XG Boost regression model and mean squared error (MSE) as evaluation metric.

Regular State

The diagram above shows the evaluation results of the regular state scenario. As the primary 20 steps were used to coach the models the evaluation error was much lower than at later steps. The performance of the None and Window retraining strategies remained at an analogous level throughout the scenario. The All Data strategy barely reduced the prediction error at higher step numbers.

On this case All Data is the perfect strategy since it profits from an increasing amount of coaching data while the models of the opposite strategies were trained on a relentless training data size.

Distribution Drift (Data Drift)

Prediction error of distribution drift scenario

When the input data distributions modified, we are able to clearly see that the prediction error constantly increased if the model was not retrained on the newest data. Retraining on all data or on an information window resulted in very similar performances. The explanation for that is that although All Data was using more data, older data was not relevant for predicting probably the most recent data.

Coefficient Drift (Concept Drift)

Prediction error of coefficient drift scenario

Changing coefficients implies that the importance of features changes over time. On this case we are able to see that the None retraining strategy had drastic increase of the prediction error. Moreover, the outcomes showed that retraining on all data also result in a continuous increase of prediction error while the Window retraining strategy kept the prediction error on a relentless level.

The explanation why the All Data strategy performance also decreased over time was that the training data contained increasingly cases where similar inputs resulted in numerous outputs. Hence, it became tougher for the model to discover clear patterns to derive decision rules. This was less of an issue for the Window strategy since older data was ignore which allowed the model to “forget” older patterns and concentrate on most up-to-date cases.

Black Swan

The black swan event occurred at step 39, the errors of all models suddenly increased at this point. Nevertheless, after retraining a brand new model on the newest data, the errors of the All Data and Window strategy recovered to the previous level. Which isn’t the case with the None retraining strategy, here the error increased around 3-fold in comparison with before the black swan event and remained on that level until the top of the scenario.

In contrast to the previous scenarios, the black swan event contained each: data drift and concept drift. It’s remarkable that the All Data and Window strategy recovered in the identical way after the black swan event while we found a big difference between these strategies within the concept drift scenario. Probably the rationale for that is that data drift occurred concurrently concept drift. Hence, patterns which were learned on older data weren’t relevant anymore after the black swan event since the input data has shifted.

An example for this may very well be that you just are a translator and also you get requests to translate a language that you just haven’t translated before (data drift). At the identical time there was a comprehensive spelling reform of this language (concept drift). While translators who translated this language for a few years could also be fighting applying the reform it wouldn’t affect you since you even didn’t know the principles before the reform.

To breed this evaluation or explore further you may take a look at my git repository.

Identifying, quantifying, and mitigating the impact of information drift and concept drift is a difficult topic. In this text I analyzed easy scenarios to present basic characteristics of those concepts. More comprehensive analyses will undoubtedly provide deeper and more detailed conclusions on this topic.

Here’s what I learned from this project:

Mitigating concept drift is tougher than data drift. While data drift may very well be handled by basic retraining strategies concept drift requires a more careful choice of training data. Paradoxically, cases where data drift and concept drift occur at the identical time could also be easier to handle than pure concept drift cases.

A comprehensive evaluation of the training data could be the best start line of finding an appropriate retraining strategy. Thereby, it is crucial to partition the training data with respect to the time when it was recorded. To make probably the most realistic assessment of the model’s performance, the newest data should only be used as test data. To make an initial assessment regarding data drift and concept drift the remaining training data may be split into two equally sized sets with the older data in a single set and the newer data in the opposite. Comparing feature distributions of those sets allows to evaluate data drift. Training one model on each set and comparing the change of feature importance would allow to make an initial assessment on concept drift.

No retraining turned out to be the worst option in all scenarios. Moreover, in cases where model retraining isn’t considered it is usually more likely that data to guage and/or retrain the model isn’t collected in an automatic way. Which means that model performance degradation could also be unrecognized or only be noticed at a late stage. Once developers turn into aware that there’s a potential issue with the model precious time could be lost until recent data is collected that may be used to retrain the model.

Identifying the proper retraining strategy at an early stage could be very difficult and will be even unattainable if there are unexpected changes within the serving data. Hence, I feel an inexpensive approach is to start out with a retraining strategy that performed well on the partitioned training data. This strategy ought to be reviewed and updated the time when cases occurred where it didn’t address changes within the optimal way. Continuous model monitoring is crucial to quickly notice and react when the model performance decreases.

If not otherwise stated all images were created by the creator.

Evaluating Model Retraining Strategies

How data drift and concept drift matter to decide on the correct retraining strategy?

Regular State

Distribution Drift (Data Drift)

Coefficient Drift (Concept Drift)

Black Swan

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

OpenAI’s record-breaking $500B valuation

Prediction vs. Search Models: What Data Scientists Are Missing

Martin Trust Center for MIT Entrepreneurship welcomes Ana Bakshi as recent executive director

AI Engineering and Evals as Latest Layers of Software Work

Apple chases Meta’s AI glasses lead

Evaluating Model Retraining Strategies

How data drift and concept drift matter to decide on the correct retraining strategy?

Regular State

Distribution Drift (Data Drift)

Coefficient Drift (Concept Drift)

Black Swan

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.