Biases in Recommender Systems: Top Challenges and Recent Breakthroughs


Image generated by the writer with Midjourney

Recommender systems have develop into ubiquitous in our day by day lives, from online shopping to social media to entertainment platforms. These systems use complex algorithms to investigate historic user engagement data and make recommendations based on their inferred preferences and behaviors.

While these systems will be incredibly useful in helping users discover recent content or products, they aren’t without their flaws: recommender systems are stricken by various types of bias that may result in poor recommendations and due to this fact poor user experience. Considered one of today’s primary research threads around recommender systems is due to this fact find out how to de-bias them.

In this text, we’ll dive into 5 of essentially the most prevalent biases in recommender systems, and find out about a number of the recent research from Google, YouTube, Netflix, Kuaishou, and others.

Let’s start.

1 — Clickbait bias

Wherever there’s an entertainment platform, there’s clickbait: sensational or misleading headlines or video thumbnails designed to grab a user’s attention and entice them to click, without providing any real value. “You won’t consider what happened next!”

If we train a rating model using clicks as positives, naturally that model might be biased in favor of clickbait. That is bad, because such a model would promote much more clickbait to users, and due to this fact amplify the damage it does.

One solution for de-biasing rating models from clickbait, proposed by Covington et al (2016) within the context of YouTube video recommendations, is , where the weights are the watch time for positive training examples (impressions with clicks), and unity for the negative training example (impressions without clicks).

Mathematically, it could be shown that such a weighted logistic regression model learns odds which can be roughly the expected watch time for a video. At serving time, videos are ranked by their predicted odds, leading to videos with long expected watch times to be placed high on top of the recommendations, and clickbait (with the bottom expected watch times) at the underside of it.

Unfortunately, Covington et al don’t share all of their experimental results, but they do say that weighted logistic regression performs “a lot better” than predicting clicks directly.

2 — Duration bias

Weighted logistic regression work well for solving the clickbait problem, but it surely introduces a recent problem: duration bias. Simply put, longer videos at all times generally tend to be watched for an extended time, not necessarily because they’re more relevant, but just because they’re longer.

Take into consideration a video catalog that accommodates 10-second short-form videos together with 2-hour long-form videos. A watch time of 10 seconds means something completely different within the two cases: it’s a robust positive signal in the previous, and a weak positive (even perhaps a negative) signal within the latter. Yet, the Covington approach wouldn’t give you the option to tell apart between these two cases, and would bias the model in favor of long-form videos (which generate longer watch times just because they’re longer).

An answer to duration bias, proposed by Zhan et al (2022) from KuaiShou, is .

The important thing idea is to bucket all videos into duration quantiles, after which bucket all watch times inside a duration bucket into quantiles as well. For instance, with 10 quantiles, such an project could appear like this:

(training example 1)
video duration = 120min --> video quantile 10
watch duration = 10s --> watch quantile 1

(training example 2)
video duration = 10s --> video quantile 1
watch duration = 10s --> watch quantile 10

By translating all time intervals into quantiles, the model understands that 10s is “high” within the latter example, but “low” in the previous, so the writer’s hypothesis. At training time, we’re providing the model with the video quantile, and task it with predicting the watch quantile. At inference time, we’re simply rating all videos by their predicted watch time, which is able to now be de-confounded from the video duration itself.

And indeed, this approach appears to work. Using A/B testing, the authors report

  • 0.5% improvements in total watch time compared weighted logistic regression (the concept from Covington et al), and
  • 0.75% improvements in total watch time in comparison with predicting watch time directly.

The outcomes show that removing duration bias is usually a powerful approach on platforms that serve each long-form and short-form videos. Perhaps counter-intuitively, removing bias in favor of long videos actually improves overall user user watch times.

3 — Position bias

Position bias signifies that the highest-ranked items are those which create essentially the most engagement not because they’re actually the most effective content for the user, but as an alternative just because they’re ranked highest, and users begin to blindly trust the rating they’re being shown. The model predictions develop into a self-fulfilling prophecy, but this will not be what we actually need. We would like to predict what users want, and never make them want what we predict.

Position bias will be mitigated by techniques similar to rank randomization, intervention harvesting, or using the ranks themselves as features, which I covered in my other post here.

Particularly problematic is that position bias will at all times make our models look higher on paper than they really are. Our models could also be slowly degrading in quality, but we wouldn’t know what is occurring until it’s too late (and users have churned away). It’s due to this fact vital, when working with recommender systems, to observe multiple quality metrics concerning the system, including metrics that quantify user retention and the range of recommendations.

4 — Popularity bias

Popularity bias refers back to the tendency of the model to provide higher rankings to items which can be more popular overall (on account of the indisputable fact that they’ve been rated by more users), quite than being based on their actual quality or relevance for a selected user. This could result in a distorted rating, where less popular or area of interest items that may very well be a greater fit for the user’s preferences aren’t given adequate consideration.

Yi et al (2019) from Google propose an easy but effective algorithmic tweak to de-bias a video advice model from popularity bias. During model training, they replace the logits of their logistic regression layer as follows:

logit(u,v) <-- logit(u,v) - log(P(v))


  • logit(u,v) is the logit function (i.e., the log-odds) for user u engaging with video v, and
  • log(P(v)) is the log-frequency of video v.

After all, the suitable hand side is reminiscent of:

log[ odds(u,v)/P(v) ]

In other words, they simply normalize the anticipated odds for a user/video pair by the video probability. Extremely high odds from popular videos count as much as moderately high odds from not-so-popular videos. And that’s all the magic.

And indeed, the magic appears to work: in online A/B tests, the authors discover a 0.37% improvement in overall user engagements with the de-biased rating model.

5 — Single-interest bias

Suppose you watch mostly drama movies, but sometimes you want to observe a comedy, and infrequently a documentary. You will have multiple interests, yet a rating model trained to maximise your watch time may over-emphasize drama movies because that’s what you’re most definitely to interact with. That is , the failure of a model to know that users inherently have multiple interests and preferences.

As a way to remove single-interest bias, a rating model must be calibrated. Calibration simply signifies that, when you watch drama movies 80% of the time, then the model’s top 100 recommendations should actually include around 80 drama movies (and never 100).

Netflix’s Harald Steck (2018) demonstrates the advantages of model calibration with an easy post-processing technique called Platt scaling. He presents experimental results that reveal the effectiveness of the tactic in improving the calibration of Netflix recommendations, which he quantifies with KL divergence scores. The resulting movie recommendations are more diverse — actually, as diverse because the actual user preferences — and lead to improved overall watch times.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x