Easy methods to Construct Popularity-Based Recommenders with Polars Initial Thoughts Most Popular Across All Customers Most Popular Per Customer Conclusion

Artificial Intelligence

Easy methods to Construct Popularity-Based Recommenders with Polars Initial Thoughts Most Popular Across All Customers Most Popular Per Customer Conclusion

admin

April 30, 2023

Easy methods to Construct Popularity-Based Recommenders with Polars
Initial Thoughts
Most Popular Across All Customers
Most Popular Per Customer
Conclusion

Basic recommenders which can be easy to know and implement, in addition to fast to coach

Recommender systems are algorithms designed to supply user recommendations based on their past behavior, preferences, and interactions. Becoming integral to numerous industries, including e-commerce, entertainment, and promoting, recommender systems improve user experience, increase customer retention, and drive sales.

While various advanced recommender systems exist, today I need to indicate you one of the straightforward — yet often difficult to beat — recommenders: the . It is a wonderful baseline recommender that you need to all the time check out along with a more advanced model, reminiscent of matrix factorization.

We’ll create two different flavors of popularity-based recommenders using in this text. Don’t worry if you’ve gotten not used the fast pandas-alternative polars before; this text is an important place to learn it along the best way. Let’s start!

Popularity-based recommenders work by suggesting probably the most steadily purchased products to customers. This vague idea could be become at the least two concrete implementations:

Check which articles are bought most frequently . Recommend these articles to every customer.
Check which articles are bought most frequently . Recommend these per-customer articles to their corresponding customer.

We’ll now show easy methods to implement these concretely using our own custom-crated dataset.

If you wish to follow together with a real-life dataset, the H&M Personalized Fashion Recommendations challenge on Kaggle provides you with a wonderful example. Attributable to copyright reasons, I is not going to use this lovely dataset for this text.

The Data

First, we’ll create our own dataset. Be certain that to put in polars for those who haven’t done so already:

pip install polars

Then, allow us to create random data consisting of a that you need to interpret as “The shopper with this ID bought the article with that ID.”. We’ll use 1,000,000 customers that can purchase 50,000 products.

import numpy as npnp.random.seed(0)
N_CUSTOMERS = 1_000_000
N_PRODUCTS = 50_000
N_PURCHASES_MEAN = 100 # customers buy 100 articles on average
with open("transactions.csv", "w") as file:
file.write(f"customer_id,article_idn") # header
for customer_id in tqdm(range(N_CUSTOMERS)):
n_purchases = np.random.poisson(lam=N_PURCHASES_MEAN)
articles = np.random.randint(low=0, high=N_PRODUCTS, size=n_purchases)
for article_id in articles:
file.write(f"{customer_id},{article_id}n") # transaction as a row

This medium-sized dataset has , an amount you possibly can find in a business context.

The Task

We now need to construct recommender systems that scan this dataset to be able to recommend popular items in some sense. We’ll make clear two variants of easy methods to interpret this:

hottest across all customers
hottest per customer

Our recommenders should recommend .

We’ll assess the standard of the recommenders here. Drop me a message for those who are inquisitive about this topic, though, because it’s price having a separate article about this.

On this recommender, we don’t even care who bought the articles — all the data we want is within the column alone.

High-level, it really works like this:

Load the info.
Count how often each article appears within the column .
Return the ten most frequent products because the suggestion for every customer.

Familiar Pandas Version

As a delicate start, allow us to take a look at how you possibly can do that .

import pandas as pddata = pd.read_csv("transactions.csv", usecols=["article_id"])
purchase_counts = data["article_id"].value_counts()
most_popular_articles = purchase_counts.head(10).index.tolist()

On my machine, this takes about . This seems like somewhat, however the dataset still has ; things get really ugly with larger datasets. To be fair, 10 seconds are used for loading the CSV file. Using a greater format, reminiscent of parquet, would decrease the loading time.

I used pandas 2.0.1, which is the most recent and most optimized version.

Still, to arrange yet somewhat bit more for the polars version, allow us to do the pandas version using , a way I grew to like.

most_popular_articles = (
pd.read_csv("transactions.csv", usecols=["article_id"])
.squeeze() # turn the dataframe with one column right into a series
.value_counts()
.head(10)
.index
.tolist()
)

This is beautiful since you possibly can read from top to bottom what is occurring without the necessity for a variety of intermediate variables that individuals normally struggle to call (df_raw → df_filtered → df_filtered_copy → … → df_final anyone?). The run time is similar, nevertheless.

Faster Polars Version

Allow us to implement the identical logic using method chaining as well.

import polars as plmost_popular_articles = (
pl.read_csv("transactions.csv", columns=["article_id"])
.get_column("article_id")
.value_counts()
.sort("counts", descending=True) # value_counts doesn't sort routinely
.head(10)
.get_column("article_id") # there aren't any indices in polars
.to_list()
)

Things look pretty similar, aside from the running time: as a substitute of 31, which is impressive!

Polars is just SO much faster than pandas.

Unarguably, that is one in every of the principal benefits of polars over pandas. Aside from that, polars also has a that pandas doesn’t have. We’ll see more of that when creating the opposite popularity-based recommender.

Additionally it is vital to notice that pandas and polars produce the identical output as expected.

In contrast to our first recommender, we wish to slice the dataframe per customer now and get the most well-liked products for every customer. Which means we want the in addition to the now.

We illustrate the logic using a small dataframe consisting of only ten transactions from three customers A, B, and C buying 4 articles 1, 2, 3, and 4. We wish to get the . We are able to achieve this using the next steps:

We start with the unique dataframe.
We then group by and and aggregate via a count.
We then aggregate again over the and write the s in an inventory, just as in our last recommender. The twist is that we .

That way, we find yourself with precisely what we wish.

A bought products 1 and a couple of most steadily.
B bought products 4 and a couple of most steadily. Products 4 and 1 would have been an accurate solution as well, but internal orderings just happened to flush product 2 into the suggestion.
C only bought product 3, in order that’s all there may be.

Step 3 of this procedure sounds especially difficult, but polars lets us handle this conveniently.

most_popular_articles_per_user = (
pl.read_csv("transactions.csv")
.groupby(["customer_id", "article_id"]) # first arrow from the image
.agg(pl.count())                        # first arrow from the image
.groupby("customer_id")                                               # second arrow
.agg(pl.col("article_id").sort_by("count", descending=True).head(10)) # second arrow
)

This version on my machine already. I didn’t create a pandas version for this, and I’m definitely scared to accomplish that and let it run. In the event you are brave, give it a try!

A Small Improvement

To date, some users might need lower than ten recommendations, and a few even have none. A simple thing to do is pad each customer’s recommendations to 10 articles. For instance,

using random articles, or
using the most well-liked articles across all customers from our first popularity-based recommender.

We are able to implement the second version like this:

improved_recommendations = (
most_popular_articles_per_user
.with_columns([
pl.col("article_id").fill_null([]).alias("personal_top_<=10"),
pl.lit([most_popular_articles]).alias("global_top_10")
])
.with_columns(
pl.col("personal_top_<=10").arr.concat(pl.col("global_top_10")).arr.head(10).alias("padded_recommendations")
)
.select(["customer_id", "padded_recommendations"])
)

Popularity-based recommenders hold a major position within the realm of suggestion systems as a result of their simplicity, ease of implementation, and effectiveness as an initial approach and a difficult-to-beat baseline.

In this text, we’ve got learned easy methods to transform the easy idea of popularity-based recommendations into code using the fabulous polars library.

The principal drawback, especially of the personalized popularity-based recommender, is that the recommendations are in any way. People have seen the entire advisable things before, meaning they’re stuck in an extreme echo chamber.

One option to mitigate this problem to some extent is through the use of other approaches, reminiscent of collaborative filtering or hybrid approaches, reminiscent of here:

I hope that you just learned something latest, interesting, and invaluable today. Thanks for reading!

To be transparent, the value for you doesn’t change, but about half of the subscription fees go on to me.

If you’ve gotten any questions, write me on LinkedIn!