Biases in Recommender Systems: Top Challenges and Recent Breakthroughs

-

Image generated by the creator with Midjourney

Recommender systems have turn into ubiquitous in our each day lives, from online shopping to social media to entertainment platforms. These systems use complex algorithms to investigate historic user engagement data and make recommendations based on their inferred preferences and behaviors.

While these systems will be incredibly useful in helping users discover recent content or products, they usually are not without their flaws: recommender systems are tormented by various types of bias that may result in poor recommendations and subsequently poor user experience. One in all today’s most important research threads around recommender systems is subsequently the way to de-bias them.

In this text, we’ll dive into 5 of essentially the most prevalent biases in recommender systems, and find out about a few of the recent research from Google, YouTube, Netflix, Kuaishou, and others.

Let’s start.

1 — Clickbait bias

Wherever there’s an entertainment platform, there’s clickbait: sensational or misleading headlines or video thumbnails designed to grab a user’s attention and entice them to click, without providing any real value. “You won’t imagine what happened next!”

If we train a rating model using clicks as positives, naturally that model will likely be biased in favor of clickbait. That is bad, because such a model would promote much more clickbait to users, and subsequently amplify the damage it does.

One solution for de-biasing rating models from clickbait, proposed by Covington et al (2016) within the context of YouTube video recommendations, is , where the weights are the watch time for positive training examples (impressions with clicks), and unity for the negative training example (impressions without clicks).

Mathematically, it could possibly be shown that such a weighted logistic regression model learns odds which might be roughly the expected watch time for a video. At serving time, videos are ranked by their predicted odds, leading to videos with long expected watch times to be placed high on top of the recommendations, and clickbait (with the bottom expected watch times) at the underside of it.

Unfortunately, Covington et al don’t share all of their experimental results, but they do say that weighted logistic regression performs “significantly better” than predicting clicks directly.

2 — Duration bias

Weighted logistic regression work well for solving the clickbait problem, but it surely introduces a recent problem: duration bias. Simply put, longer videos at all times generally tend to be watched for an extended time, not necessarily because they’re more relevant, but just because they’re longer.

Take into consideration a video catalog that incorporates 10-second short-form videos together with 2-hour long-form videos. A watch time of 10 seconds means something completely different within the two cases: it’s a powerful positive signal in the previous, and a weak positive (even perhaps a negative) signal within the latter. Yet, the Covington approach wouldn’t find a way to tell apart between these two cases, and would bias the model in favor of long-form videos (which generate longer watch times just because they’re longer).

An answer to duration bias, proposed by Zhan et al (2022) from KuaiShou, is .

The important thing idea is to bucket all videos into duration quantiles, after which bucket all watch times inside a duration bucket into quantiles as well. For instance, with 10 quantiles, such an project could appear to be this:

(training example 1)
video duration = 120min --> video quantile 10
watch duration = 10s --> watch quantile 1

(training example 2)
video duration = 10s --> video quantile 1
watch duration = 10s --> watch quantile 10
...

By translating all time intervals into quantiles, the model understands that 10s is “high” within the latter example, but “low” in the previous, so the creator’s hypothesis. At training time, we’re providing the model with the video quantile, and task it with predicting the watch quantile. At inference time, we’re simply rating all videos by their predicted watch time, which can now be de-confounded from the video duration itself.

And indeed, this approach appears to work. Using A/B testing, the authors report

  • 0.5% improvements in total watch time compared weighted logistic regression (the concept from Covington et al), and
  • 0.75% improvements in total watch time in comparison with predicting watch time directly.

The outcomes show that removing duration bias is usually a powerful approach on platforms that serve each long-form and short-form videos. Perhaps counter-intuitively, removing bias in favor of long videos in truth improves overall user user watch times.

3 — Position bias

Position bias signifies that the highest-ranked items are those which create essentially the most engagement not because they’re actually the very best content for the user, but as a substitute just because they’re ranked highest, and users begin to blindly trust the rating they’re being shown. The model predictions turn into a self-fulfilling prophecy, but this shouldn’t be what we really need. We wish to predict what users want, and never make them want what we predict.

Position bias will be mitigated by techniques resembling rank randomization, intervention harvesting, or using the ranks themselves as features, which I covered in my other post here.

Particularly problematic is that position bias will at all times make our models look higher on paper than they really are. Our models could also be slowly degrading in quality, but we wouldn’t know what is going on until it’s too late (and users have churned away). It’s subsequently necessary, when working with recommender systems, to observe multiple quality metrics concerning the system, including metrics that quantify user retention and the range of recommendations.

4 — Popularity bias

Popularity bias refers back to the tendency of the model to present higher rankings to items which might be more popular overall (because of the proven fact that they’ve been rated by more users), fairly than being based on their actual quality or relevance for a selected user. This could result in a distorted rating, where less popular or area of interest items that might be a greater fit for the user’s preferences usually are not given adequate consideration.

Yi et al (2019) from Google propose a straightforward but effective algorithmic tweak to de-bias a video advice model from popularity bias. During model training, they replace the logits of their logistic regression layer as follows:

logit(u,v) <-- logit(u,v) - log(P(v))

where

  • logit(u,v) is the logit function (i.e., the log-odds) for user u engaging with video v, and
  • log(P(v)) is the log-frequency of video v.

After all, the best hand side is corresponding to:

log[ odds(u,v)/P(v) ]

In other words, they simply normalize the expected odds for a user/video pair by the video probability. Extremely high odds from popular videos count as much as moderately high odds from not-so-popular videos. And that’s the whole magic.

And indeed, the magic appears to work: in online A/B tests, the authors discover a 0.37% improvement in overall user engagements with the de-biased rating model.

5 — Single-interest bias

Suppose you watch mostly drama movies, but sometimes you want to look at a comedy, and on occasion a documentary. You have got multiple interests, yet a rating model trained to maximise your watch time may over-emphasize drama movies because that’s what you’re more than likely to interact with. That is , the failure of a model to know that users inherently have multiple interests and preferences.

So as to remove single-interest bias, a rating model must be calibrated. Calibration simply signifies that, in the event you watch drama movies 80% of the time, then the model’s top 100 recommendations should in truth include around 80 drama movies (and never 100).

Netflix’s Harald Steck (2018) demonstrates the advantages of model calibration with a straightforward post-processing technique called Platt scaling. He presents experimental results that show the effectiveness of the strategy in improving the calibration of Netflix recommendations, which he quantifies with KL divergence scores. The resulting movie recommendations are more diverse — in truth, as diverse because the actual user preferences — and end in improved overall watch times.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

2 COMMENTS

0 0 votes
Article Rating
guest
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

2
0
Would love your thoughts, please comment.x
()
x