The way to Construct Popularity-Based Recommenders with Polars

Artificial Intelligence

The way to Construct Popularity-Based Recommenders with Polars

admin

April 28, 2023

The way to Construct Popularity-Based Recommenders with Polars

Basic recommenders which are easy to grasp and implement, in addition to fast to coach

Recommender systems are algorithms designed to offer user recommendations based on their past behavior, preferences, and interactions. Becoming integral to numerous industries, including e-commerce, entertainment, and promoting, recommender systems improve user experience, increase customer retention, and drive sales.

While various advanced recommender systems exist, today I would like to point out you probably the most straightforward — yet often difficult to beat — recommenders: the popularity-based recommender. It is a superb baseline recommender that you must at all times check out along with a more advanced model, comparable to matrix factorization.

We are going to create two different flavors of popularity-based recommenders using polars in this text. Don’t worry if you may have not used the fast pandas-alternative polars before; this text is an amazing place to learn it along the best way. Let’s start!

Popularity-based recommenders work by suggesting essentially the most often purchased products to customers. This vague idea will be changed into no less than two concrete implementations:

Check which articles are bought most frequently across all customers. Recommend these articles to every customer.
Check which articles are bought most frequently per customer. Recommend these per-customer articles to their corresponding customer.

We are going to now show easy methods to implement these concretely using our own custom-crated dataset.

If you must follow together with a real-life dataset, the H&M Personalized Fashion Recommendations challenge on Kaggle provides you with a wonderful example. Attributable to copyright reasons, I won’t use this lovely dataset for this text.

The Data

First, we are going to create our own dataset. Be sure that to put in polars if you happen to haven’t done so already:

pip install polars

Then, allow us to create random data consisting of a (customer_id, article_id) pairs that you must interpret as “The client with this ID bought the article with that ID.”. We are going to use 1,000,000 customers that can purchase 50,000 products.

import numpy as npnp.random.seed(0)
N_CUSTOMERS = 1_000_000
N_PRODUCTS = 50_000
N_PURCHASES_MEAN = 100 # customers buy 100 articles on average
with open("transactions.csv", "w") as file:
file.write(f"customer_id,article_idn") # header
for customer_id in tqdm(range(N_CUSTOMERS)):
n_purchases = np.random.poisson(lam=N_PURCHASES_MEAN)
articles = np.random.randint(low=0, high=N_PRODUCTS, size=n_purchases)
for article_id in articles:
file.write(f"{customer_id},{article_id}n") # transaction as a row

This medium-sized dataset has over 100,000,000 rows (transactions), an amount you would find in a business context.

The Task

We now need to construct recommender systems that scan this dataset in an effort to recommend popular items in some sense. We are going to make clear two variants of easy methods to interpret this:

hottest across all customers
hottest per customer

Our recommenders should recommend ten articles for every customer.

Note: We are going to not assess the standard of the recommenders here. Drop me a message if you happen to are fascinated by this topic, though, because it’s price having a separate article about this.

On this recommender, we don’t even care who bought the articles — all the knowledge we want is within the article_id column alone.

High-level, it really works like this:

Load the information.
Count how often each article appears within the column article_id.
Return the ten most frequent products because the suggestion for every customer.

Familiar Pandas Version

As a delicate start, allow us to take a look at how you would do that in pandas.

import pandas as pddata = pd.read_csv("transactions.csv", usecols=["article_id"])
purchase_counts = data["article_id"].value_counts()
most_popular_articles = purchase_counts.head(10).index.tolist()

On my machine, this takes about 31 seconds. This feels like just a little, however the dataset still has only a moderate size; things get really ugly with larger datasets. To be fair, 10 seconds are used for loading the CSV file. Using a greater format, comparable to parquet, would decrease the loading time.

Note: I used pandas 2.0.1, which is the newest and most optimized version.

Still, to arrange yet just a little bit more for the polars version, allow us to do the pandas version using method chaining, a method I grew to like.

most_popular_articles = (
pd.read_csv("transactions.csv", usecols=["article_id"])
.squeeze() # turn the dataframe with one column right into a series
.value_counts()
.head(10)
.index
.tolist()
)

This is beautiful since you’ll be able to read from top to bottom what is occurring without the necessity for lots of intermediate variables that individuals normally struggle to call (df_raw → df_filtered → df_filtered_copy → … → df_final anyone?). The run time is identical, nonetheless.

Faster Polars Version

Allow us to implement the identical logic in polars using method chaining as well.

import polars as plmost_popular_articles = (
pl.read_csv("transactions.csv", columns=["article_id"])
.get_column("article_id")
.value_counts()
.sort("counts", descending=True) # value_counts doesn't sort routinely
.head(10)
.get_column("article_id") # there aren't any indices in polars
.to_list()
)

Things look pretty similar, aside from the running time: 3 seconds as a substitute of 31, which is impressive!

Polars is just SO much faster than pandas.

Unarguably, that is one in all the important benefits of polars over pandas. Aside from that, polars also has a convenient syntax for creating complex operations that pandas doesn’t have. We are going to see more of that when creating the opposite popularity-based recommender.

It’s also essential to notice that pandas and polars produce the identical output as expected.

In contrast to our first recommender, we wish to slice the dataframe per customer now and get the most well-liked products for every customer. Which means we want the customer_id in addition to the article_id now.

We illustrate the logic using a small dataframe consisting of only ten transactions from three customers A, B, and C buying 4 articles 1, 2, 3, and 4. We would like to get the top two articles per customer. We are able to achieve this using the next steps:

We start with the unique dataframe.
We then group by customer_id and article_id and aggregate via a count.
We then aggregate again over the customer_id and write the article_ids in an inventory, just as in our last recommender. The twist is that we sort this list by the count column.

That way, we find yourself with precisely what we wish.

A bought products 1 and a couple of most often.
B bought products 4 and a couple of most often. Products 4 and 1 would have been an accurate solution as well, but internal orderings just happened to flush product 2 into the suggestion.
C only bought product 3, in order that’s all there’s.

Step 3 of this procedure sounds especially difficult, but polars lets us handle this conveniently.

most_popular_articles_per_user = (
pl.read_csv("transactions.csv")
.groupby(["customer_id", "article_id"]) # first arrow from the image
.agg(pl.count())                        # first arrow from the image
.groupby("customer_id")                                               # second arrow
.agg(pl.col("article_id").sort_by("count", descending=True).head(10)) # second arrow
)

By the best way: This version runs for about a few minute on my machine already. I didn’t create a pandas version for this, and I’m definitely scared to accomplish that and let it run. In the event you are brave, give it a try!

A Small Improvement

To this point, some users may need lower than ten recommendations, and a few even have none. A straightforward thing to do is pad each customer’s recommendations to 10 articles. For instance,

using random articles, or
using the most well-liked articles across all customers from our first popularity-based recommender.

We are able to implement the second version like this:

improved_recommendations = (
most_popular_articles_per_user
.with_columns([
pl.col("article_id").fill_null([]).alias("personal_top_<=10"),
pl.lit([most_popular_articles]).alias("global_top_10")
])
.with_columns(
pl.col("personal_top_<=10").arr.concat(pl.col("global_top_10")).arr.head(10).alias("padded_recommendations")
)
.select(["customer_id", "padded_recommendations"])
)

Popularity-based recommenders hold a major position within the realm of suggestion systems resulting from their simplicity, ease of implementation, and effectiveness as an initial approach and a difficult-to-beat baseline.

In this text, now we have learned easy methods to transform the straightforward idea of popularity-based recommendations into code using the fabulous polars library.

The important drawback, especially of the personalized popularity-based recommender, is that the recommendations are not inspiring in any way. People have seen all the really helpful things before, meaning they’re stuck in an extreme echo chamber.

One strategy to mitigate this problem to some extent is through the use of other approaches, comparable to collaborative filtering or hybrid approaches, comparable to here:

I hope that you just learned something recent, interesting, and helpful today. Thanks for reading!

Because the last point, if you happen to

need to support me in writing more about machine learning and
plan to get a Medium subscription anyway,

why not do it via this link? This is able to help me quite a bit! 😊

To be transparent, the worth for you doesn’t change, but about half of the subscription fees go on to me.

Thanks quite a bit if you happen to consider supporting me!

If you may have any questions, write me on LinkedIn!