Warden: Real Time Anomaly Detection at Pinterest What’s Warden? Detecting Real Time ML Model Drift Detecting Spam Future Acknowledgements

Artificial Intelligence

Warden: Real Time Anomaly Detection at Pinterest What’s Warden? Detecting Real Time ML Model Drift Detecting Spam Future Acknowledgements

admin

May 20, 2023

Warden: Real Time Anomaly Detection at Pinterest
What’s Warden?
Detecting Real Time ML Model Drift
Detecting Spam
Future
Acknowledgements

Isabel Tallam | Sw Eng, Real Time Analytics; Charles Wu | Sw Eng, Real Time Analytics; Kapil Bajaj | Eng Manager, Real Time Analytics

Blue, green, red and orange lines on a graph fluctuating between high and low levels

Detecting anomalous events has been becoming increasingly essential in recent times at Pinterest. Anomalous events, broadly defined, are rare occurrences that deviate from normal or expected behavior. Because a lot of these events could be found almost anywhere, opportunities and applications for anomaly detection are vast. At Pinterest, now we have explored leveraging anomaly detection, specifically our Warden Anomaly Detection Platform, for several use cases (which we’ll get into on this post). With the positive results we’re seeing, we’re planning to proceed to expand our anomaly detection work and use cases.

On this blog post, we’ll walk through:

We’ll detailthe general architecture and design philosophy of the platform.
Recently, now we have been adding functionality to review ML scores to our Warden anomaly detection platform. This permits us to investigate any drift within the models.
Detection and removal of spam and users who create spam is a priority in keeping our systems secure and providing an important experience for our users.

Warden is the anomaly detection platform created at Pinterest. The important thing design principle for Warden is modularity — constructing the platform in a modular way in order that we are able to easily make changes.

Why? Early on in our research, it became quickly clear that there have been many approaches to detecting anomalies, depending on the style of data or how anomalies could also be defined for the info. Different approaches and algorithms could be needed to accommodate those differences. With this in mind, we worked on creating three different modules, modules that we’re still using today:

Query input data: retrieves data to be analyzed from data source.
Applying anomaly algorithm: analyzes the info and identifies any outliers
Notification: returning results or alerts for consuming systems to trigger next steps

This modular approach has enabled us to simply adjust for brand spanking new data types and plug in recent algorithms when needed. Within the sections below we’ll review two of our predominant use cases: ML Model Drift and Spam Detection.

The primary use case is our ML Monitoring project. This section will provide details on why we initiated this project, which technologies and algorithms we used, and the way we solved a few of the road blocks we experienced through the implementation of the changes.

Why Monitor Model Drift?

Pinterest, like many firms, uses machine learning in several areas and has seen much success with it. Nonetheless, over time a model’s accuracy can decrease as outside aspects change. The issue we were facing was the way to detect these changes, which we confer with as drifts.

What’s model drift actually? Let’s assume Pinterest users (Pinners) are searching for clothing ideas. If the present season is winter, then coats and scarves could also be trending and the ML models could be recommending pins matching winter clothing. Nonetheless once the season starts getting warmer, Pinners might be more enthusiastic about lighter clothing for spring and summer. At this point, a model which remains to be recommending winter clothing is not any longer accurate because the user data is shifting. This is named model drift and the ML team should take motion and update features for instance to correct the model output.

Lots of our teams using ML have tried their very own approaches to implement changes or update models Nonetheless, we would like to be certain that that the teams can focus their efforts and resources on their actual goals and never spend an excessive amount of time on determining the way to discover drifts.

We decided to look into the issue from a holistic perspective, and put money into finding a single solution that we are able to provide with Warden.

Top graph displays a tight line with frequent fluctuation, bottom graph is a wider line with significantly less fluctuations. — Figure 1: Comparing raw model scores (top) and downsampled model scores (bottom) shows a slight drift of the model scores over time

As step one to catching drift in model scores, we wanted to discover how we wanted to take a look at the info. We identified three different approaches to analyzing the info:

Comparing current data with historical data — for instance one week ago, one month ago, etc.
Comparing data between two different environments — for instance, staging and production
Comparing current prod data with predefined data which is how the model is predicted to perform

In our first version of the platform, we decided to take the primary approach that compares historical data. We made this decision because this approach provided insights intothe model changes over time, signaling re-training could also be required.

Choosing the Right Algorithm

To discover a drift in model scores, we wanted to be certain that we select the correct algorithm, one that may allow us to simply discover any drift within the model. After researching different algorithms, we narrowed it right down to Population Stability Index (PSI) and Kullback-Leibler Divergence/Jensen-Shannon Divergence (KLD/JSD). In our first version, we decided to implement PSI, as this algorithm has also been proven successful in other use cases. In the long run, we’re planning to plug other algorithms to expand our options.

The algorithm for PSI splits up the input data and divides it into 10 buckets. A straightforward example is dividing a listing of users by their ages. We assign all and sundry into an age bucket. A bucket is created for every 10-year age range: 0–10 years, 11–20 years, 21–30 years, etc. For every bucket, the proportion is calculated of how much data we discover in that range. Then we compare each bucket of current data with a bucket of historical data. This may end in a single rating for every bucket-computation. The sum of those scores might be the general PSI rating. This could be used to find out how the age of the population has modified over time.

Graphs has percentages of 1%, 3%, 8%, 19%, 31%, 22%, 8%, 5%, 2%, 1% from bottom to top. — Figure 2: Image showing input data split into 10 buckets and for every bucket the proportion of distribution is calculated

In our current implementation, we calculate the PSI rating by comparing historical model scores with current model scores. To do that, we first determine the bucket size depending on the input data. Then, we calculate the bucket percentages for every timeframe, which is used to return the PSI rating. The upper the PSI rating, the more drift the mode is experiencing through the chosen period.

The calculation is repeated every jiffy with the input window sliding to offer a continuous PSI rating showing clearly how the model scores are changing over time.

Top image is “Input Data”, “Historical window” and “Current window” in the middle, and “PSI scores over time”. — Figure 3: Image showing the input data (top), windows for historical data and current data (middle) that are used for PSI rating calculation (bottom).

Tuning the Algorithm

Through the validation phase, we noticed that the dimensions of the time window has an important impact on the usefulness of the PSI rating. Selecting a window that is simply too small may end up in very volatile PSI scores, potentially creating alerts for even small deviations. Selecting a period that is simply too large can potentially mask issues in model drift. In our case, we’re seeing good results with a 3-hour window, and PSI calculation every 3–5 minutes. This configuration might be highly depending on the volatility of the info and SLA requirements on drift detection.

One other change we noticed within the calculated PSI scores was that a few of the scores were higher than expected. This was true especially for model scores that don’t deviate much from the expected range. We must always assume a resulting PSI rating of 0 or near 0 for these use cases.

After a deeper investigation on the input data, we found that the calculated bucket size for these instances was set to a particularly small value. As our logic features a calculation of bucket sizes on the fly, this happened for model scores with a really narrow data range and that showed a number of spikes in the info.

Figure 4: Model rating which shows little or no deviation from expected values of 0.05 to 0.10.

Logically, the PSI calculation is correct. Nonetheless, on this particular use case, tiny variations of lower than 0.1 should not concerning. To make the PSI scores more relevant, we implemented a configurable minimum size for buckets — a minimum of 0.1 for many cases. Results with this configuration are actually more meaningful for the ML teams reviewing the info.

This configuration, nonetheless, might be highly depending on each model and what number of change is taken into account a deviation from the norm. In some cases a deviation of 0.001 can be quite substantial and would require much smaller bucket sizes.

Figure 5: Left side — high PSI scores of 0.05 to 0.25 are seen with a small bucket size. Once minimum bucket size configuration was updated, the scores were much smaller with values of 0 to 0.03 as expected — right side.

Now that now we have implemented the historical comparison and PSI rating calculation on model scores, we’re in a position to detect any changes in model scores early on in the method and in near-real time. This enables our engineers to be alerted quickly if any model drift occurs and take motion before the changes end in a production issue.

Given this early success,, we are actually planning to extend our use of PSI scores. We might be implementing the evaluation of feature drift in addition to looking into the remaining comparison options mentioned above.

Detecting spam is the second use case for Warden. In the next section, we’ll look into why we want spam detection and the way we selected the Yahoo Extensible Generic Anomaly Detection System (EGADS) library for this project.

Why is Spam Detection So Essential?

Before discussing spam detection, let’s deal with what we define as spam and why we would like to analyze it. Pinterest is a worldwide platform with a mission to present everyone the inspiration to create a life that they love. Meaning constructing a positive place that connects our global audience, over 450 million users, to personalized, actionable content — a spot where they’ll find inspiration, plan and shop the world’s best ideas into reality.

Certainly one of our highest priorities, and a core value of Putting Pinners First, is to make sure an important experience for our users, whether or not they are finding their next weeknight meal inspiration or searching for a loved one’s birthday or simply wanting to take a wellness break. When they give the impression of being for inspiration and as a substitute find spam, this is usually a big issue. Some malicious users create pins and link these to pages that should not related to the pin image. As a user clicking on a delicious recipe image, landing on a really different page could be frustrating, and subsequently we would like to be certain that this doesn’t occur.

Figure 6: A pin showing a chocolate cake on the left. After clicking on the pin the user sees a page not related to cake.

Removing spammy pins is one a part of the answer, but how can we prevent this from happening again? We don’t just wish to remove the symptom, which is the bad content, we would like to remove the source of the problem and be certain that we discover malicious users to stop them from continuing to create spam.

How Can We Discover Spam?

Detecting malicious users and spam is crucial for any business today, but it may be very difficult. Identifying newly created spam users could be especially tedious and time consuming. Behavior of spam users is just not all the time clearly distinguishable. Spammer behavior and attempts also evolve over time to evade detection.

Before our Warden anomaly detection platform was available, identifying spam required our Trust and Safety team to manually run queries, review and evaluate the info, after which trigger interventions for any suspicious occurrences.

So how can we know when spam is being created? Generally, malicious users don’t just create a single spam pin. To earn a living, they wish to create a lot of spam pins at a time and widen their net. This helps us discover these users. Taking a look at pin creation, for instance, we all know that we expect something like a sine wave when taking a look at the variety of pins created per day or week. Users create pins through the day and fewer pins are created at night. We also know that there could also be some variations depending on the day of the week.

Figure 7: sample curve for created pins over 7 days showing a near sine wave with some every day variations.

The general graph reflecting the count of created pins shows an analogous pattern that repeats on a every day and weekly basis. Identifying any spam or increased creation of pins could be very difficult as spam remains to be a small percentage in comparison with the total set of information.

To get a more effective grained picture, we drilled down into further details and filtered by specific parameters. These parameters included filters like web service provider used (ISP) , country of origin, event types (creation of pins, etc.), and lots of other options. This allowed us to take a look at smaller and smaller datasets where spikes are clearer r and more easily identifiable.

With the knowledge gained on how normal user data without spam should look, we movedforward and looked closer to judge anomaly detection options:

Data is predicted to follow an analogous pattern over time
We are able to filter the info to recover insights
We wish to learn about any spikes in the info as potential spam

Implementation of the Spam Detection System

We began taking a look at several frameworks which are available and already support numerous the functionality we were searching for. Comparing several of the choices, we decided to go ahead with Yahoo! EGADS framework [https://github.com/yahoo/egads].

This framework analyzes the info in two steps. The Tuning Process reads historical data and determines the info expected in the long run. Detection is the second step, during which the actual data is in comparison with the expectation and any outliers exceeding an outlined threshold are marked as anomalies.

So, how are we using this library inside our Warden anomaly detection platform? To detect anomalies, we want to go through several phases.

In the primary phase we offer all required configurations needed for the tasks. This includes details concerning the source of the input data, which anomaly detection algorithms to make use of, parameters for use through the detection step, and at last the way to handle the outcomes.

Having the configuration in place, Warden begins by connecting to the info source and querying input data. With the modular approach, we’re in a position to plug in numerous sources and add additional connectors every time needed. Our first version of Warden focused on reading data from our Apache Druid cluster. As the info is real time data and already grouped by timestamps, this lends itself to anomaly detection very easily. For later projects, now we have also added a Presto connector to support recent use cases.

Once the info is queried from the info source, it’s transformed into the required format for the Tuning/Detection phase. Feeding the info into the EGADS Time Series Modeling Module (TM) triggers the Tuning step which is followed by the Detection step using a number of Anomaly Detection Models (ADM) to discover any outliers.

Selecting the Time Series Module relies on the style of input data. Similarly, deciding which Anomaly Detection Model to make use of relies on the style of outliers we would like to detect. If you happen to are searching for more details on this and EGADS, please confer with the gitHub page.

After retrieving the outcomes and identifying any suspicious outliers, we are able to proceed to look further into the info. The initial step will have a look at broader filtering, like identifying any spikes found on per ISP, origin country, etc. In further steps, we take the insights gained from step one and filter using additional features. At this point, we are able to ignore any data sets that don’t show any concerns and focus on suspicious data to discover malicious users or confirm all actions are valid.

Figure 8: Analyzing pin creation data by base filters allows identifying outliers and drilling deeper brings anomalies to light

Once now we have gathered enough details on the info, we proceed with our last phase, which is the notification phase. At this stage, we notify any subscribers of potential anomalies. Details are provided via email, Slack, and other avenues to tell our Trust and Safety team to take motion to deactivate users, block users, etc.

With the usage of the Warden anomaly detection platform, now we have been in a position to improve Pinterest’s spam detection efforts, significantly impacting the variety of malicious users identified and the way quickly we’re in a position to detect them. This has been an important improvement in comparison with manual investigations.

Our Trust & Safety teams have appreciated the usage of Warden and are planning to extend their use cases.

“One of the essential things we want for identifying spammers is to accurately segment features and time periods before we do any clustering or measurement. Warden enabled us to get alerted early and find crucial segment to run our algorithms on.” — Trust & Safety Team

With the ability to detect anomalies with Warden has enabled us to support our Trust and Safety team and allows us to detect drift in our ML models in a short time. This has been proven to extend user experience and support our engineering teams. The teams are continuing to judge spam and spam patterns,allowing us to evolve the detection and broaden the underlying data.

In the long run, we’re planning to extend the usage of anomaly detection to get alerted early on about any changes within the Pinterest system before actual issues occur. One other use case we’re planning to incorporate in our platform is root cause evaluation. This might be applied on current and historical data, enabling our teams to cut back time spent to pinpoint issue causes and focus on quickly addressing them.

Many due to our partner teams and their engineers (Cathy Yang | Trust & Safety; Howard Nguyen | MLS; Li Tang | MLS) who’ve been working with us on accomplishing these projects and for all their support!

To learn more about engineering at Pinterest, take a look at the remaining of our Engineering Blog and visit our Pinterest Labs site. To explore life at Pinterest, visit our Careers page.