Constructing a big scale unsupervised model anomaly detection system

Artificial Intelligence

Constructing a big scale unsupervised model anomaly detection system — Part 2

admin

April 27, 2023

Constructing a big scale unsupervised model anomaly detection system — Part 2

Constructing ML Models with Observability at Scale

By Rajeev Prabhakar, Han Wang, Anindya Saha

A camera lens looking at a city downtown — Photo by Octavian Rosca on Unsplash

In our previous blog we discussed the various challenges we faced for model monitoring and our strategy for addressing a few of these problems. We briefly mentioned using z-scores to discover anomalies. On this post, we are going to dive deeper into anomaly detection and constructing a culture of observability.

Model observability is commonly neglected but is critical within the machine learning model lifecycle. Developing an excellent observability strategy helps to narrow down problems quickly at their roots and take appropriate actions similar to model retraining, improving feature selection, and troubleshooting feature drifts.

The instance below is what our finished product looks like. The highlighted regions are the timeframes where anomalies were detected. With a dashboard that accommodates the corresponding features, it becomes quick to diagnose the basis reason for the anomaly.

Utilizing Data Profiling

In our part-1 blog, we talked in regards to the importance of knowledge profiling.

Although it is not uncommon practice to observe anomalies based on specific aggregated metrics on raw data, the query stays, which metrics are helpful. For outliers, minimum, maximum, and 99th percentile metrics are very useful. For numerical distribution drifts, mean and median are effective metrics. For categorical data, frequent items are useful for detecting categorical drift and cardinality for the information quality overall. It is clear that various metrics are needed, and these requirements can even change over time. Recomputing metrics for big datasets may be very slow and value prohibitive.

That is the principal reason to leverage data profiling. We selected whylogs because we will construct profiles of assorted functional, integral and distribution metrics with one pass. Another excuse to decide on whylogs for us is its low latency and the power to suit right into a MapReduce framework.

Anomaly Detection Design Principles

Below are the principal aspects we considered when constructing an anomaly detection solution. Because we already created a time series of profiles within the previous step, the answer focuses on leveraging the forecasted confidence intervals to search out anomalies.

To make sure adoption across the broader organization, the answer must be general and versatile to plug into most, if not all, business use cases. Anomaly detection was historically lacking due to the amount of domain-specific logic for every implementation. The anomaly detection solution goals to create something general-purpose that may function the primary line of defense. For more critical applications, more detailed business rules may be applied on top of the overall solution.

Striking a balance between accuracy and speed is crucial in terms of time-series forecasting. To make sure we’re using the most effective tool for the job, we evaluated several popular forecasting libraries, including Facebook Prophet, LinkedIn Greykite, and Nixtla StatsForecast. After careful consideration, we’ve decided to adopt StatsForecast for time-series anomaly detection resulting from its exceptional performance.

Regarding accuracy, Statsforecast provides a wide selection of statistical and econometric models to forecast univariate time series. With this package, we will easily select from different models like ARIMA, MSTL, ETS and Exponential Smoothing which are neatly wrapped behind the identical caller interface, enabling us to judge and generate forecasts from multiple models concurrently with just a number of lines of Python code. The models implemented in Statsforecast have been written from scratch, and have shown incredible performance in recent forecasting competitions. StatsForecast also includes several experiments testing the performance of their models.

In terms of speed, Statsforecast really stands out. Its models run at impressively fast speeds, due to its effective use of Numba and parallel computing. This implies we will generate forecasts quickly and efficiently, without compromising on accuracy. With its powerful combination of accuracy and speed, Statsforecast is the proper tool for our time-series anomaly detection needs.

A general solution should work on each small and large datasets. The issue is that this may mean maintaining a Pandas solution and a Spark solution. Fugue, an open-source abstraction layer for distributed computing that brings Python and Pandas code to Spark, is allowing users to define their logic with the local packages they’re comfortable with, and scaling it with minimal wrappers. Abstracting away the execution engine allows us to focus solely on defining the logic once.

Identifying Potential Anomalies

For time-series anomalies, although the z-score based approach is effective and SQL friendly, it generates too many false positives unless the thresholds are adjusted for each scenario. We also employ a complicated machine learning approach with a low false positive rate without per-scenario adjustment.

With Statsforecast’s models, we do have loads of options to select from based on the characteristics of knowledge. Statsforecast also allows the power for a user to run multiple models without delay on the identical timeseries with none noticeable additional time. This offers great flexibility for a number of teams who’re early on within the journey of anomaly detection. Below is an example selecting AutoARIMA because the model to generate forecasts.

Along with obtaining forecasts, we will access the in-sample prediction values of every model. We use this to research anomalies on historical data.

An anomaly is any data that deviates from the arrogance interval. From the below graph, we will see the anomalies we present in predictions for the model based on “distribution/mean” metric. Only the times where anomalies were reported are highlighted.

Anomalies highlighted on model predictions

Exploring Anomaly Root Cause

Anomalies by themselves seldom offer any actionable insights. In an ML model, explaining a prediction anomaly resulting from input feature drift can provide worthwhile context for understanding the anomalies.

Now that we’ve got found the anomalies, let’s try to know if we will discover the rationale for the anomalies. The above graph doesn’t give any insights into why there was an anomaly. On this section, we are going to investigate how reasoning anomalies are related to feature drift.

In a stable machine learning system, the aggregation metrics of features often remain predictable (with clear seasonality and trend) over time. On this section, let’s examine how feature drift over time impacts the predictions of the models.

Identical to the profile of prediction values, we’ve got all the metrics from profiles available for each feature. Below is an example of time series generated from “distribution/mean” values from the features profile

Distribution/Mean of features of the model

Since we try to discover a feature drift that corresponds to a change in prediction values, we are going to train a regressor with changes in features over consecutive time periods and measure changes in predictions for a similar time interval. We are going to then use Shapely values to elucidate the model.

Importance of feature drift relative to the prediction

The bar chart above tells us the stack rank of feature drift importance with respect to the prediction value change. We are able to see that a change in request_latency had a profound impact on the prediction value of the model. Using this information, plotting the identified allows enables us to look at them.

The highlighted red lines (anomalous hours from the prediction data) co-inside with a spike and drop in request latency, causing anomalies within the input prediction.

Monitoring and Alerting for Anomalies

It’s important to have an efficient communication channel for the discovered anomalies. We built an easy dashboard using Mode Analytics. Although these dashboards provide good insights into the model, the profit is simply realized with timely motion on detected anomalies.

To handle this, we send soft alerts through Slack as an alternative of integrating with services like PagerDuty. Each model has a model owner and team with a Slack handle. Upon detection of any anomaly, the corresponding Slack channel gets notified.

Note that the identified anomalies are based on the past performance of the metric. While we will select an excellent confidence interval to attenuate false positives, it’s inevitable that they are going to occur. In our experience, using systems like PagerDuty can result in alert fatigue for teams, and results in a lack of interest or trust within the system generating alerts.

Applications

On this section, we discuss the various scenarios where anomaly detection systems have been used.

Taking Timely Motion on ML Models

Each model is mechanically onboarded onto the anomaly detection system since we depend on system logs moderately than user setup. This has drastically reduced the turnaround time to motion on broken models without explicit user motion to setup for anomaly detection.

The prediction anomaly detection for ML models relies on hourly or each day profiles. Generalizing this idea, these are time series data. Constructing out the anomaly detection platform allowed us to plug in any business metrics with an outlined time interval. Our Operations team uses anomaly detection on a number of the most important business metrics to get timely alerts and have a look at historical trends for forecasting corrections.

We’re also experimenting with using the Statsforecast package to generate forecasts for future time horizons (similar to the subsequent two days) after which compare the real-time values with the forecasted values to find out whether the real-time value is anomalous. Real-time values are considered anomalous in the event that they fall outside the arrogance bounds of forecasts, and we notify users in real-time when such deviations occur. This enables us to catch anomalous predictions inside a number of minutes.

Acknowledgments

Special due to Shiraz Zaman and Mihir Mathur for Engineering and Product Management support behind this work.