Constructing a Recommender System using Machine Learning Problem Statement: Multi-Objective Recommender System Find out how to Approach a RecSys for a Large Database of Items Stage 1: Candidate Generation with Co-Visitation Matrix Stage 2: Reranking with GBDT Model Summary Enjoyed This Story? References

“Candidate rerank” approach with co-visitation matrix and GBDT ranker model in Python

A waiter making recommendations to a customer — “A wonderful selection, madam! Our burger pairs perfectly with a side and a drink. May I suggest some options?” (Image by the creator)

Welcome to the primary edition of a latest article series called “The Kaggle Blueprints”, where we are going to analyze Kaggle competitions’ top solutions for lessons we will apply to our own data science projects.

This primary edition will review the techniques and approaches from the “OTTO — Multi-Objective Recommender System” competition, which ended at the top of January, 2023.

The goal of the “OTTO — Multi-Objective Recommender System” competition was to construct .

Specifically, within the e-commerce use case, competitors were coping with the next details:

multi-objective: clicks, cart additions, and orders
large dataset: over 200 million events for about 1,8 million items
implicit user data: previous events in a user session

Certainly one of the primary challenges of this competition was the massive variety of items to select from. Feeding all the available information into a fancy model would require the provision of intensive amounts of computational resources.

Thus, the overall baseline most competitors of this challenge followed is the [3]:

Stage: candidate generation — This step reduces the variety of potential recommendations (candidates) for every user from tens of millions to about 50 to 200 [2]. To handle the quantity of knowledge, a straightforward model is often used for this step.
Stage: reranking — You should utilize a more complex model for this step, equivalent to an Machine Learning (ML) model. Once you could have ranked your reduced candidates, you possibly can select the highest-ranked items as recommendations.

Two-stage recommender system: candidate generation / rerank technique — Two-stage recommender candidate generation/rerank technique (Image by creator, inspired by [3])

Step one of the two-stage approach is to scale back the variety of potential recommendations (candidates) from tens of millions to about 50 to 200 [2]. To take care of the massive variety of items, the primary model needs to be easy [5].

You may select and mix different strategies to scale back the variety of items [3]:

by user history
by popularity — this strategy can even function a robust baseline [5]
by co-occurrence based on a co-visitation matrix

Essentially the most straightforward approach to generate candidates is to make use of the : If a user has viewed an item, they’re prone to purchase it as well.

Nevertheless, if the user has viewed fewer items (e.g., five items) than the variety of candidates we wish to generate per user (e.g., 50 to 200), we will populate the list of candidates by item popularity or co-occurrence [7]. Since selection by popularity is easy, we are going to deal with candidate generation by co-occurrence on this section.

The of two items may be approached with a If user_1 bought item_a and shortly after item_b , we store this information [6, 7].

Minimal example of users’ buying behavior for recommender system (Image by the creator)

For every item, count the occurrences of each other item inside a specified timeframe.

Minimal example of co-visitation matrix (Image by the creator)

2. For every item, find the 50 to 200 most frequent items visited after this item.

As you possibly can see from the image above, a co-visitation matrix shouldn’t be necessarily symmetrical. For instance, someone who bought a burger can also be prone to buy a drink — but the alternative is probably not true.

You may as well assign weights to the co-visitation matrix based on proximity. For instance, items bought together in the identical session could have the next weight than items a user bought across different shopping sessions.

The co-visitation matrix resembles doing [6]. Matrix factorization is a well-liked technique for recommender systems. Specifically, it’s a collaborative filtering method that finds the connection between items and users.

The second step is . While you possibly can achieve a very good performance with handcrafted rules [1], in theory, using an ML model should work higher [5].

You should utilize different Gradient Boosted Decision Tree (GBDT) rankers like XGBRanker or LGBMRanker [2, 3, 4].

Preparation of coaching data and have engineering

The training data for the GBDT ranker model should contain the next column categories [2]:

— The bottom for the dataframe will likely be the list of candidates generated in the primary stage. For every user, you need to find yourself with N_CANDIDATES , and thus, the place to begin needs to be a dataframe of shape (N_USERS * N_CANDIDATES, 2)
—counts, aggregation features, ratio features, etc.
—counts, aggregation features, ratio features, etc.
(optional)— You may create user-item interfaction features, equivalent to ‘item clicked’
— For every user-item pair, merge the labels (e.g., ‘bought’ or ‘not bought’).

The resulting training dataframe should look something like this.

Training data structure for training a machine learning (GDBT ranker) for a recommender system — Training data structure for training a GDBT ranker model for a recommender system (Image by the creator)

GBDT ranker models

This step goals to coach a GBDT ranker model to pick out the top_N recommendations.

The GBDT ranker will take three inputs:

X_train, X_val: training and validation data frames containing FEATURES
y_train, y_val: training and validation data frames containing LABELS
group : Note that the FEATURES user, item columns [2]. Thus, the model needs the data on inside which group to rank the items: group = [N_CANDIDATES] * (len(train_df) // N_CANDIDATES)

Below yow will discover the sample code with XGBRanker [2].

import xgboost as xgbdtrain = xgb.DMatrix(X_train,
y_train, 
group = group) 
# Define model
xgb_params = {'objective' : 'rank:pairwise'} 
# Train
model = xgb.train(xgb_params, 
dtrain = dtrain,
num_boost_round = 1000)

Below yow will discover the sample code with LGBMRanker [4]:

from lightgbm.sklearn import LGBMRanker# Define model
ranker = LGBMRanker(
objective="lambdarank",
metric="ndcg",
n_estimators=1000)
# Train
model = ranker.fit(X_train, 
y_train,
group = group)

The GBDT rating model will rank the items inside the required group. To retrieve the top_N recommendations, you simply have to group the output by the user and type by the item’s rating.

There are numerous more lessons to be learned from reviewing the educational resources Kagglers have created in the course of the course of the “OTTO — Multi-Objective Recommender System” competition. There are also many various solutions for such a problem statement.

In this text, we focused on the overall approach that was popular amongst many competitors:

Subscribe without cost to get notified once I publish a latest story.

Turn into a Medium member to read more stories from other writers and me. You may support me through the use of my referral link whenever you enroll. I’ll receive a commission at no extra cost to you.

Find me on LinkedIn, Twitter, and Kaggle!

[1] Chris Deotte (2022). “Candidate ReRank Model — [LB 0.575]” in Kaggle Notebooks. (accessed 26. February 2023)

[2] Chris Deotte (2022). “How To Construct a GBT Ranker Model” in Kaggle Discussions. (accessed 21. February 2023)

[3] Ravi Shah (2022). “Advice Systems for Large Datasets” in Kaggle Discussions. (accessed 21. February 2023)

[4] Radek Osmulski (2022). “[polars] Proof of concept: LGBM Ranker” in Kaggle Notebooks. (accessed 26. February 2023)

[5] Radek Osmulski (2022). “ Introduction to the OTTO competition on Kaggle (RecSys)” on YouTube. (accessed 21. February 2023)

[6] Radek Osmulski (2022). “What’s the co-visitation matrix, really?” in Kaggle Discussions. (accessed 21. February 2023)

[7] Vladimir Slaykovskiy (2022). “Co-visitation Matrix” in Kaggle Notebooks. (accessed 21. February 2023)

“Candidate rerank” approach with co-visitation matrix and GBDT ranker model in Python

Preparation of coaching data and have engineering

GBDT ranker models

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

AI’s Growing Power Needs: Tech Industry’s Move Towards Nuclear Power

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

“Candidate rerank” approach with co-visitation matrix and GBDT ranker model in Python

Preparation of coaching data and have engineering

GBDT ranker models

What are your thoughts on this topic? Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.