Scaling Recommender Transformers to a Billion Parameters

! My name is Kirill Khrylchenko, and I lead the RecSys R&D team at Yandex. One in all our goals is to develop transformer technologies inside the context of recommender systems, an objective we’ve been pursuing for five years now. Not too way back, we reached a brand new milestone in the event of suggestion technologies, which I would really like to share with you in this text.

The relevance of recommender systems on the earth is simple to justify: the quantity of content is growing incredibly fast, making it not possible to view in its entirety, and we’d like recommender systems to handle the knowledge overload. Music, movies, books, products, videos, posts, friends, but it surely’s vital to do not forget that these services profit not only users but additionally content creators who need to seek out their target market.

We’ve deployed a brand new generation of transformer recommenders in several services and are actively integrating them with other services. We’ve significantly improved the standard of the recommendations across the board.

In the event you’re an ML engineer working with recommendations, this text will offer you some ideas on the right way to implement an identical approach on your recommender system. And in the event you are a user, you will have a chance to learn more about how that very recommender system works.

How Recommender Systems Work

The suggestion problem itself has a straightforward mathematical definition: for every user

we wish to pick items, objects, documents, or products

that they’re prone to like.

But there’s a catch:

Item catalogs are vast (as much as billions of things).
There’s a big variety of users, and their interests are always shifting.
Interactions between users and items are very sparse.
It’s unclear the right way to define actual user preferences.

To tackle the suggestion problem effectively, we’d like to leverage non-trivial models that use machine learning.

Neural networks are a potent machine learning tool, especially when there’s a considerable amount of unstructured data, akin to text or images. While traditional classical machine learning involves expert domain knowledge and considerable manual work (feature engineering), neural networks can extract complex relationships and patterns from raw data almost mechanically.

Within the RecSys domain, we’ve got a considerable amount of mostly unstructured data (literally trillions of anonymized user-item interactions), in addition to entities which can be content-based (items consist of titles, descriptions, images, videos, and audio; users may be represented as sequences of events). Moreover, it’s crucial that the recommender system performs well for brand spanking new items and cold users, and encoding users and items through content helps achieve this.

The time we’ve got to generate recommendations for the user could be very strictly limited. Every millisecond counts! Moreover, we don’t have infinite resources (when it comes to hardware), and the catalogs we’d like recommendations from are quite large. For this reason recommendations are frequently formed in multiple stages:

First, we select a comparatively small set of candidates from all the catalog using lightweight models (retrieval stage).
Then, we run these candidates through more complex models that utilize additional information and more intensive computations for every candidate (rating stage).

Architecturally, models vary significantly between stages, making it difficult to debate any aspect without referring to specific stages of the recommender system.

Multi-stage recommender systems, Image by Writer

The 2-tower neural network architecture is extremely popular for the retrieval stage. Users and items (for information retrieval, this is able to be queries and documents, independently encoded into vector representations) are used, and the dot product is employed to calculate the similarity between them.

You can also say that such models “embed” users and items right into a shared “semantic space”, where the semantic aspect represents the incontrovertible fact that the closer the user-item pair is when it comes to vector space, the more similar they’re.

Two-tower models are high-speed. Let’s assume the user requests recommendations. The 2-tower model then must calculate:

The “user tower” once per request.
Vectors of all candidate items for which you wish to calculate user-item affinity.
Dot products.

You don’t even have to recalculate the vectors of candidate items for every user query, because they’re the identical for all users and barely change; for example, we don’t assume that a movie or a music track often changes its title. In practice, we often recalculate item vectors for all the catalog offline (for instance, every day) and upload them to either the service where we’d like to calculate the dot product or to a different service that we access online to retrieve the mandatory item vectors.

But that’s me describing a use case where you will have some reasonable, small variety of candidates you wish to calculate user-item affinities for. That is true for the rating stage. Nevertheless, on the candidate generation stage, the issue becomes more complicated: we’d like to calculate proximities for all items within the catalog, select the top-N (where N is often expressed in a whole lot to 1000’s) with the best affinity values, after which forward them to the following stages.

That is where two-tower models are invaluable: we are able to quickly generate an approximate top-N by scalar product, even for enormous catalogs, using approximate search methods. We construct a selected “index” (typically a graph structure, akin to within the HNSW method) for the set of already calculated item vectors that we are able to store within the service and use to feed user vectors, extracting an approximate top for these vectors.

Constructing this index is difficult and time-consuming (with a separate challenge of quickly updating and rebuilding an index). With that being said, it may still be done offline, after which the binary and the index may be uploaded to the service, where we’ll seek for candidates within the runtime environment.

Two-tower neural network, Image by Writer

How Do We Encode a User Right into a Vector?

Classical algorithms solved this problem quite easily: in matrix factorization methods (like ALS), the user vector was “trainable”, represented by the model parameters, and determined inside the optimization procedure. In user-item collaborative filtering methods, a user was assigned a vector of catalog dimensionality during which the i-th coordinate corresponded to a specific item and represented how often the user interacted with that item (e.g., how often they bought it or how they rated it).

The fashionable approach can be to encode users with transformers, suggesting that a user may be encoded right into a vector using transformers. We take the user’s anonymized history—that’s, a sequence of events—and encode these events into vectors, then utilize a transformer. In probably the most basic case, events are represented by purchases or likes; nonetheless, in other cases, it might be all the history of interactions inside an organization’s ecosystem.

Initially, when transformers were first utilized in recommendations, researchers drew analogies from similarities with NLP: a user is sort of a sentence, and the words in it represent purchases, likes, and other interactions.

Two-tower neural network design with a transformer Image by Writer

One other variety of neural network recommender model is models with early fusion. These models don’t separate user and item information into two towers but fairly process all information together. That’s, we fuse all information concerning the user, the item, and their interaction at an early stage. In contrast, two-tower models are said to feature late fusion through the scalar product. Early-fusion models are more expressive than two-tower models. They will capture more complex signals and learn more non-trivial dependencies.

Nevertheless, it’s difficult to use them outside the rating stage due to their computational burden and the necessity to recalculate all the model for every user query and every candidate. Unlike two-tower models, they don’t support the factorization of computations.

We utilize various architecture types, including two-tower models with transformers and models with early fusion. We use two-tower architectures more actually because they’re highly efficient, suitable for all stages concurrently, and still yield good quality gains with considerably fewer resources.

We used to coach two-tower models in two stages:

Pre-training with contrastive learning. We train the model to align users with their positive user-item interactions using contrastive learning,
Task-specific fine-tuning. As with NLP, fine-tuning is a task-specific approach. If the model might be used for rating, we train it to accurately rank the recommendations shown to the user. We showed two items—the user liked one, disliked the opposite—we wish to rank items in the identical order. With retrieval, the duty resembles pre-training but employs additional techniques that enhance the candidates’ recall.

In the following section, we’ll explore how this process has modified with our newer models.

Scaling Recommender Systems

Is there a limit to the dimensions of recommender models, after which we now not see size-related improvements in the standard of recommendations?

For a very long time, our recommender models (and not only ours, but models across industry and academia) were very small, which suggested that the reply to this query was “yes”.

Nevertheless, in deep learning, there’s the scaling hypothesis, which states that as models develop into larger and the quantity of information increases, the model quality should improve significantly.

Much of the progress in deep learning over the past decade may be attributed to this hypothesis. Even the earliest successes in deep learning were based on scaling, with the emergence of an intensive dataset for image classification, ImageNet, and the great performance of neural networks (AlexNet) on that dataset.

The scaling hypothesis is much more evident in language models and natural language processing (NLP): you’ll be able to predict the dependence of quality improvement on the quantity of computations and express the corresponding scaling laws.

Dashboard parameter overview. Image by Writer

What do I mean once I say recommender models may be made greater?

There are as many as 4 different axes to scale.

Embeddings. We now have a wide range of details about users and items, so we’ve got access to a wide selection of features, and a big portion of those features are categorical. An example of a categorical feature is Item ID, artist ID, genre, or language.

Categorical features have a really high cardinality (variety of unique values)—reaching billions—so in the event you make large trainable embeddings (vector representations) for them, you get huge embedding matrices.

That said, embeddings are the bottleneck between the input data and the model, so you might want to make them large for good quality. For instance, Meta* has embedding matrices with dimensions starting from 675 billion to 13 trillion parameters, while Google reported at the least 1 billion parameters in YoutubeDNN back in 2016. Even Pinterest, which had long promoted inductive graph embeddings from PinSage [1, 2], has recently began using large embedding matrices.

Context length. For a long time, recommender system engineers have been busy generating features. In modern rating systems, the variety of features can reach a whole lot and even 1000’s, and Yandex also offers such services.

One other example of “context” in a model is the user’s history in a transformer. Here, the dimensions of the context is decided by the length of the history. In each industry and academia, the number tends to be very small, with only a number of hundred events at best.

Training dataset size. I already mentioned that we’ve got lots of data. Recommender systems produce a whole lot of datasets which can be similar in size to the GPT-3 training dataset.

The industry has multiple use cases of massive datasets with billions of coaching examples on display: 2 billion, 2.1 billion, 3 billion, 60 billion, 100 billion, 146 billion, 500 billion.

Encoder size. The usual for early-fusion models might be in tens of millions or tens of tens of millions of parameters. In line with the Google papers, “simplified” versions of their Wide&Deep models had 1 to 68 million parameters for the experiments [1, 2]. And if we use a two-layer DCN-v2 (a preferred neural network layer for early-fusion models) over a thousand continuous features, we’ll get not more than 10 million parameters.

Two-tower models most frequently use tiny transformers to encode the user: for instance, two transformer blocks with hidden layer dimensionality not exceeding a few hundred. This configuration can have at most a number of million parameters.

And while the sizes of the embedding matrices and training datasets are already quite large, scaling the length of user history and the capability of the encoder a part of the model stays an open query. Is there any significant scaling by these parameters or not?

This was the query on our minds in February, 2024. Then an article from researchers at Meta, titled Actions Speak Louder than Words, cheered us all up a bit.

The аuthors presented a brand new encoder architecture called HSTU and formulated each the rating problem and the candidate generation problem as a generative model. The model had a really long history length (8000 events!) together with an intensive training dataset (100 billion examples), and the user history encoder was much larger than the previous couple of million parameters. Nevertheless, even here, the biggest encoder configuration mentioned, has only 176 million parameters, and it’s unclear whether or not they implemented it (judging by the following articles, they didn’t).

Are 176 million parameters in an encoder quite a bit or just a little? If we take a look at language models, the reply is obvious: an LLM with 176 million parameters within the encoder might be highly inferior in capability and problem-solving quality to modern SOTA models with billions and even trillions of parameters.

Why, then, do we’ve got such small models in recommender systems?

Why can’t we achieve an identical leap in quality if we replace natural language texts with anonymized user histories during which actions act as words? Have recommender models already reached the ceiling of their baseline quality, and all we’ve got left is to make small incremental improvements, tweaking features and goal values.

These were the existential questions we asked ourselves when designing our own recent ARGUS approach.

RecSys × LLM × RL

After plowing through the extensive literature on scaling, we found that three major conditions determine the success of neural network scaling:

A lot of data.
Quite expressive architecture with a big model capability.
Essentially the most general, fundamental learning task possible.

For instance, LLMs are very expressive and powerful transformers that learn from literally all the information on the web. Moreover, the duty of predicting the following word is a fundamental task that, in point of fact, decomposes into various tasks related to different fields, including grammar, erudition, mathematics, physics, and programming. All three conditions are met!

If we take a look at recommender systems:

We even have lots of data: trillions of interactions between users and items.
We will just as easily use transformers.
We just need to seek out the fitting learning task to scale the recommender model.

That’s what we did.

There’s an interesting aspect of pre-training large language models. In the event you just ask a pre-trained LLM about something, it should give a mean answer. The most certainly answer it has encountered within the training data. That answer won’t necessarily be good or right.

But in the event you add a prompt before the query, like “Imagine you might be an authority in X”, it should start providing way more relevant and proper answers.

That’s because LLMs don’t just learn to mimic answers from the web; additionally they acquire a more fundamental understanding of the world in an try and condense all the knowledge from the training set. It learns patterns and abstractions. And it’s precisely since the LLM knows a wide selection of answers and yet possesses a fundamental understanding of the world that we are able to obtain good answers from it.

Venn Diagram : What Makes for a Good Answer? Image by Writer

We tried to use this logic to recommender systems. First, you might want to express the recommendations as a reinforcement learning task:

A recommender system is an agent.
Actions are recommendations. In probably the most basic case, the recommender system recommends one item at a time (for instance, recommends one music track within the music streaming app every time).
The environment means users, their behaviors, patterns, preferences, and interests.
The policy is a probability distribution over items.
The reward is a user’s positive feedback in response to a suggestion.

Recommendations as a Reinforcement Learning Task, Image by Writer

There’s a direct analogy to the LLM example. “Answers from the web” are the actions of past recommender systems (logging policies), and fundamental knowledge concerning the world is knowing users, their patterns, and preferences. We would like our recent model to give you the option to:

Imitate the actions of past recommender systems.
Have a very good understanding of the users.
Adjust their actions to realize a greater end result.

Before we move on to our recent approach, let’s examine the preferred setup for training suggestion transformers: next—item prediction. The SASRec model could be very representative here. The system accumulates a user’s history of positive interactions with the service (for instance, purchases), and the model learns to predict which purchase is prone to come next within the sequence. That’s, as a substitute of next-token prediction, as in NLP, we go for next-item prediction.

Self-Attentive Sequential Suggestion. Source

This approach (SASRec and customary next item prediction) isn’t consistent with the philosophy I described earlier, which focused on adjusting the logging policy based on fundamental knowledge of the world. It will seem that to predict what the user will buy next, the model should operate under this philosophy:

It should understand what might be shown to the user by the recommender system that was in production on the time for which the prediction needs to be made. That’s, it must have a very good model of logging policy behavior (i.e., a model that may be used to mimic).
It needs to grasp what the user might need liked from the things shown by the past recommender system, meaning that it needs to grasp their preferences, that are the very fundamental beliefs concerning the world.

But models like SASRec don’t explicitly model any of these items. They lack complete details about past logging policies (we only see recommendations with positive outcomes), and we also don’t learn the right way to replicate these logging policies. There’s no approach to know what the past recommender system could have offered. At the identical time, we don’t fully understand the model of the world or the user: we ignore all negative feedback and only consider positive feedback.

ARGUS: AutoRegressive Generative User Sequential Modeling

AutoRegressive Generative User Sequential modeling (ARGUS) is our recent approach to training suggestion transformers.

First, we examine all the anonymized user history, including positive interactions but additionally all other interactions. We capture the essence of the interaction context, the time it occurred, the device used, the product page the user was on, their My Vibe personalization settings, and other relevant details.

ARGUS: AutoRegressive Generative User Sequential Modeling

User history is a selected sequence of triples (context, item, feedback), where context refers back to the interaction context, item represents the item the user interacts with, and feedback denotes the user’s response to the interaction (akin to whether the user liked the item, bought it, etc.).

Next, we discover two recent learning tasks, each of which extend beyond the standard next-item prediction widely utilized in industry and academia.

Next item prediction

Our first task can also be called next item prediction. Taking a look at the history and the present interaction context, we predict which item might be interacted with: P(item | history, context).

If the history accommodates only suggestion traffic (events generated directly by the recommender system), then the model learns to mimic the logging policy (recommendations from the past recommender system).

If there’s also organic traffic (any traffic aside from referral traffic, akin to traffic from search engines like google and yahoo, or if the user visits the library and listens to their favorite track), we also gain more fundamental knowledge concerning the user, unrelated to the logging policy.

Vital: though this task has the identical name as in SASRec (next item prediction), it’s not the identical task in any respect. We predict not only positive but additionally negative interactions, and likewise take note of the present context. The context helps us understand whether the motion is organic or not, and if it’s a suggestion, what surface it’s on (place, page, or carousel). Also, it generally reduces the noise level during model training.

Context is important for music recommendations: the user’s mood and their current situation have a big impact on the variety of music they wish to take heed to.

The duty of predicting a component from a set is often expressed as a classification problem, where the weather of the unique set function classes. Then, we’d like to make use of a cross-entropy loss function for training, where the softmax function is applied to the logits (unnormalized outputs of the neural network). Softmax calculation requires computing the sum of exponents from logits across all classes.

While the dimensions of dictionaries in LLMs can reach a whole lot of 1000’s of things within the worst case, and softmax calculation isn’t a big problem, it becomes a priority in recommender systems. Here, catalogs consist of tens of millions and even billions of things, and calculating the complete softmax is an not possible task. It is a topic for a separate big article, but eventually, we’ve got to make use of a tough loss function called “sampled softmax” with a logQ correction:

N means a mixture of in-batch and uniform negatives
logQ(n)means logQ correction
Temperature Tmeans a trained parameter Eᵀclipped to [0.01, 100].

Feedback prediction

Feedback prediction is the second learning task. Considering history, the present context, and the item, we predict user feedback: P(feedback | history, context, item).

The primary task, next item prediction, teaches us the right way to imitate logging policies (and understanding users if there’s organic traffic). The feedback prediction task, however, is concentrated exclusively, on getting fundamental knowledge about users, their preferences, and interests.

It is vitally much like how the rating variant of the model from “Actions Speak Louder than Words” learns on a sequence of pairs (item, motion). Still, here the context token is treated individually, and there are greater than just recommender contexts present.

Feedback can have multiple components: whether a track was liked, disliked, added to a playlist, and what portion of the track was listened to. We predict all sorts of feedback by decomposing them into individual loss functions. You need to use any loss function as a selected loss function, including cross-entropy or regression. For instance, binary cross-entropy is sufficient to predict whether a like was present or not.

Although some feedback is more common (there are frequently far fewer likes than long listens), the model does a very good job of learning to predict all signals. The larger the model, the better it’s to learn all tasks directly, without conflicts. Furthermore, frequent feedback (listens), quite the opposite, helps the model learn the right way to simulate rare, sparse feedback (likes).

Diagram illustrating how the transformer model performs next-item and feedback prediction. Image by Writer

If we mix all this right into a single learning task, we get the next:

Create histories for the user from triples (context, item, feedback).
Use the transformer.
Predict the following item based on the hidden state of the context.
Predict the user’s feedback after interacting with the item based on the item’s hidden state.

The image illustrates the difference between the ARGUS and SASRec approaches: with ARGUS, we train the model to mimic the behavior of past recommender systems and predict the user’s response; in contrast, with SASRec, we train the model to predict the following positive interaction.

Let me also comment on how this differs from HSTU. In Actions Speak Louder than Words, the authors train two separate models for candidate generation and rating. The candidate generation model accommodates all the history, but, like SASRec, it models only positive interactions and doesn’t consider the loss function in cases where there’s a negative interaction. The rating model, as mentioned earlier, learns for a task much like our feedback prediction.

Our solution offers a more comprehensive next item prediction task and a more comprehensive feedback prediction task, and the model learns in each functions concurrently.

Simplified ARGUS

Our approach has one big problem—we’re inflating the user’s history. Because each interaction with an item is represented by three tokens directly (context, item, feedback), we’d should feed almost 25,000 tokens into the transformer to investigate 8192 recent user listens.

One could argue that this continues to be not significant and that the context length is for much longer in LLMs; nonetheless, this isn’t entirely accurate. LLMs, on average, have much smaller numbers, typically a whole lot of tokens, especially during pre-training.

In contrast, in our music streaming platform, for instance, users often have 1000’s and even tens of 1000’s of events. We have already got for much longer context lengths, and inflating those lengths by an element of three has a good greater impact on learning speed. To tackle this, we created a simplified version of the model, during which each triple (context, item, feedback) is condensed right into a single vector. By way of input format, it resembles our previous generations of transformer models; nonetheless, we maintain the identical two learning tasks—next item prediction and feedback prediction.

To predict the following item, we take the hidden state from the transformer corresponding to the triple (c, i, f) at a past cut-off date, concatenate the present context vector to it, compress it to a lower dimension using an MLP, after which use the sampled softmax to learn to predict the following item.

To predict the feedback, we concatenate the vector of the present item after which use an MLP to predict all of the required goal variables. By way of recommender transformer architectures, our model becomes less target-aware and fewer context-aware; nonetheless, it still performs well, enabling a three-fold acceleration.

ARGUS Implementation

A model trained on this two-headed mode for each tasks concurrently (next item prediction and feedback prediction) may be implemented as is. The NIP head is answerable for candidate selection, and the FP head for final rating.
But we didn’t want to do this, at the least not for our first implementation:

Our goal was to implement a really large model, so we initially focused on offline deployment. With offline deployment, user and item vectors are recalculated every day inside a separate regular process, and also you only have to calculate the dot product within the runtime environment.

The pre-trained version of ARGUS implies access to the user’s history with none delay: we see all events of their history as much as the present cut-off date when the prediction is made. That’s, it must be applied at runtime.
The NIP head predicts all user interactions, and the model is generally trained to predict only future positive interactions to generate candidates. But predicting positive interactions is a heuristic, a surrogate learning task. It’d even be higher to make use of a head that predicts all interactions since it learns to be consistent with the rating. If an item has been advisable, it means the rating liked it. But in this case, we weren’t able to experiment with that and as a substitute desired to follow the well-trodden path.
The FP head learns for pointwise losses: whether a track might be liked or not, what portion of the track might be heard, and so forth. But we still often train models for pairwise rating: we learn to rank items that were advisable “next to one another” and received different feedback. Some argue that pointwise losses are sufficient for training rating models, but on this case, we don’t replace all the rating stack. As a substitute, we aim so as to add a brand new, powerful, neural-network-based feature to the ultimate rating model. If the ultimate rating model is trained for a specific task (akin to pairwise rating), then the neural network that generates the feature is most efficiently trained for that task; otherwise, the ultimate model will rely less on our feature. Accordingly, we’d prefer to pre-train ARGUS for a similar task as the unique rating model, allowing us to put it to use in rating.

There are other deployment use cases beyond the standard candidate generation and rating stages, and we’re actively researching these as well. Nevertheless, for our first deployment, we went with an offline two-tower rating:

We decided to fine-tune ARGUS in order that it might be used as an offline two-tower model. We use it to recalculate user and item vectors every day, while user preferences are determined through the dot product of the user with the items.

We pre-trained ARGUS for a pairwise rating task much like the one on which the ultimate rating model is trained. Which means that we’ve got one way or the other chosen pairs of tracks that the user heard and rated otherwise when it comes to positive feedback, and we wish to learn the right way to rank them appropriately.

We construct these models very often: they’re easy to coach and implement when it comes to resources and development costs. Nevertheless, our previous models were significantly smaller and learned otherwise. Not with the ARGUS procedure, but first with the standard contrastive learning between users and positives, after which fine-tuned for the duty.

Our previous contrastive pre-training procedure implied compiling multiple training examples for a user: if the user had n purchases, then there can be n samples within the dataset. That said, we didn’t use autoregressive learning. That’s, we ran the transformer n times during training. This approach enabled us to be very flexible in creating pairs (user, item) for training, use any history format, encode context along with the user, and account for lags. When predicting likes, we are able to use a one-day lag within the user’s history. Nevertheless, things were running pretty slowly.

ARGUS pre-training employs autoregressive learning, where we learn from all events within the user’s activity concurrently in a single transformer run. It is a powerful acceleration that allowed us to coach much larger models using the identical resources.

During fine-tuning, we also ran the transformer again and again for a single user. It known as impression-level learning that Meta used to have before HSTU. If a user is shown an item at a selected moment, we generate a sample of the shape (user, item). The dataset can contain a lot of such impressions for a single user, and we are going to rerun the transformer for every considered one of them. For pairwise rating, we considered triples of the shape (user, item1, item2). Those we used before.

Examining the acceleration through the pre-training stage, we decided to employ an identical approach with fine-tuning. We develop a fine-tuning procedure for the two-tower model to show it rating, where the transformer only must be run once.

Diagram of how transformers use historical impressions and user states to form predictions. Image by Writer

Let’s say we’ve got the user’s entire history for a yr, and all of the recommendations shown to the user inside the same period. By implementing a transformer with a causal mask over all the history, we get vector representations of the user for all of the moments in that yr directly, and so we are able to:

Individually calculate the vectors of the shown items.
Review the timestamps and map suggestion impressions to user vectors corresponding to the required lag in user history delivery.
Calculate all of the required scalar products and all terms of the loss function.

And all of this directly for all the yr—in a single transformer run.

Previously, we’d rerun the transformer for every pair of impressions; now, we process all of the impressions directly in a single run. It is a massive acceleration: by an element of tens, a whole lot, and even 1000’s. To employ a two-tower model like this, we are able to simply use the vector representation of the user on the last moment in time (corresponding to the last event within the history) as the present vector representation. For the items, we are able to use the encoder that was used during training for the impressions. In training, we simulate a one-day user history lag after which run the model as an offline model, recalculating user vectors every day.

Once I say that we process the user’s entire yr of history in a single transformer run, I’m being somewhat misleading. In point of fact, we’ve got a certain limit on the utmost history length that we implement, and a user in a dataset can have multiple samples or chunks. For pre-training, these chunks don’t overlap.

Nevertheless, during fine-tuning, there are limits not only on the utmost history length but additionally on its minimum length, in addition to on the utmost variety of suggestion impressions in a single training example used to coach the model for rating.

Results

We selected our music streaming as the primary service to experiment with. Recommendations are crucial here, and the service has a lot of energetic users. We’ve built a large training dataset with over 300 billion listens from tens of millions of users. That is tens and even a whole lot of times larger than the training datasets we’d used before.

What’s a triple (context, item, feedback) in a music streaming service?

Context: whether the present interaction is a suggestion or organic. If it’s a suggestion—what surface it’s on, and if it’s My Vibe—what the settings are.
Item: a music track. An important feature for item encoding is the item ID. We use unified embeddings to encode features with high cardinality. On this case, we take three 512K hashes per item. We use a hard and fast unified embedding matrix with 130 million parameters in our experiments.
User feedback: whether a track was liked, and what portion of the track was heard.

For offline quality assessment, we use data from the week following the training period through the worldwide temporal split.

To evaluate the standard of the pre-trained model, we examine the loss function values within the pre-training tasks: next item prediction and feedback prediction. That’s, we measure how well the model learned to resolve the tasks we created for it. The smaller the worth, the higher.

Vital: We consider the user’s history over a protracted period, however the loss function is barely calculated for events that occur inside the test period.

During fine-tuning, we learn to appropriately rank item pairs based on user feedback, making PairAccuracy— a metric that measures the share of pairs appropriately ordered by the model —an acceptable offline metric for us. In practice, we reweigh pairs barely more based on feedback: for instance, pairs during which the person liked and skipped a track have the next weight than those during which the person listened to and skipped a track.

Our deployment scenario involves adding a strong recent feature to the ultimate ranker. Because of this, we measure the relative increase in PairAccuracy for the ultimate ranker with the brand new feature added, in comparison with the ultimate ranker without it. The ultimate ranker in our music streaming platform is gradient boosting.

A/B Test Results and Measurements

Our initial goal was to scale suggestion transformers. To check the scaling, we chosen 4 different-sized transformer configurations, starting from 3.2 million to 1.007 billion parameters.

We also decided to check the performance of the HSTU architecture. In “Actions Speak Louder than Words“, the authors proposed a brand new encoder architecture, which is sort of different from the transformer architecture. Based on the authors’ experiments, this architecture outperforms transformers in suggestion tasks.

Performance test dashboard. Image by Writer

There’s scaling! Each recent jump in architecture size leads to a top quality gain, each in pre-training and fine-tuning.

HSTU proved to be no higher than transformers. We used the biggest configuration mentioned by the authors in “Actions Speak Louder than Words.” It has one and a half times more parameters than our medium transformer, while having roughly the identical quality.

Graph describing the connection between model size, entropy prediction, and rating uplift. Image by Writer.

Let’s visualize the metrics from the table as a graph. In that case, we are able to observe the scaling law for our 4 points: the dependence of quality on the logarithm of the variety of parameters appears linear.

We performed a small ablation study to seek out out whether we could simplify our model or remove any parts from the training.

Results with pre-training vs without, Image by Writer

In the event you remove pre-training, the model’s quality drops.

Wonderful-tuning and pairwise accuracy results, Image by Writer

In the event you reduce the duration of fine-tuning, the drop becomes much more pronounced.

Noticeable scaling in history length, Image by Writer

In the beginning of this text, I discussed that the authors of “Actions Speak Louder than Words” trained a model with a history length of 8,000 items. We decided to provide it a try: it seems that handling such a deep user’s musical history leads to a noticeable improvement in recommendations. Previously, our models utilized a maximum of 1,500–2,000 events. This was the primary time we had the chance to cross this threshold.

Implementation Results

We’ve been working to develop transformers for music recommendations for about three years now and we’ve come a great distance. Here’s every thing we’ve got learned and the way we’ve got progressed developing transformer-based models for music recommendations over this time.

Our first three transformers were all offline. User and item vectors were recalculated every day. Then, user vectors were loaded right into a key-value store, and item vectors were stored within the service’s RAM, while only the dot product was calculated at runtime. We utilized a few of these models not just for rating, but additionally for candidate generation (we’re conversant in constructing multi-head models that perform each tasks). In cases like this, the HNSW index, from which candidates may be retrieved, also resides within the service’s RAM.
The primary model only had a signal about likes, the second model had a signal about listens (including skips), and within the third model, we combined each signal types (explicit and implicit).
The v4 version of the model is an adaptation of v3, which is implemented in runtime with a slight lag in user history, its encoder is 6x smaller than that of the v3 model.
The brand new ARGUS model has eight times the user history length and ten times the encoder size. It also uses a brand new learning process I described earlier.

Implementation version dashboard, Image by Writer

TLT is the entire listening time. The “like” likelihood represents the possibilities of a user liking a suggestion when it’s shown to them. Each implementation resulted in a metrics boost for our user-tailored recommendations. And the primary ARGUS gave concerning the same increase in metrics as all of the previous implementations combined!

ARGUS Test Results Dashboard, Image by Writer

My Vibe also has a special setting, which we use a separate rating stack for: Unfamiliar. We had a separate ARGUS implementation for this setting, achieving a 12% increase in total listening time and a ten% growth in likelihood. The Unfamiliar setting is utilized by people who find themselves interested by discovering recent recommendations. The incontrovertible fact that we experienced a big increase on this category confirms that ARGUS is more practical at handling non-trivial scenarios.

We implemented ARGUS in music scenarios on smart devices and successfully increased the entire time users spend with an energetic speaker by 0.75%. Here, the ultimate ranker isn’t a gradient boosting model, but a full-scale rating neural network. For this reason, we were capable of not only feed a single scalar feature from ARGUS but additionally pass full user and item vectors as input to the ultimate ranker. In comparison with a single scalar feature, this increased the standard gain by one other one and a half to 2 times.

ARGUS has already been implemented not only as a rating feature, but additionally to generate candidates. The team has adapted the offline ARGUS right into a runtime version. These implementations yielded significant gains in key metrics. Neural networks are the longer term of recommender systems but there’s still a protracted journey ahead.

Thanks for reading.

Scaling Recommender Transformers to a Billion Parameters

How Recommender Systems Work

How Do We Encode a User Right into a Vector?

Scaling Recommender Systems

RecSys × LLM × RL

ARGUS: AutoRegressive Generative User Sequential Modeling

Next item prediction

Feedback prediction

Simplified ARGUS

ARGUS Implementation

Results

A/B Test Results and Measurements

Implementation Results

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

a Leaderboard for Real World Use Cases

Patch Time Series Transformer in Hugging Face

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

Scaling Recommender Transformers to a Billion Parameters

How Recommender Systems Work

How Do We Encode a User Right into a Vector?

Scaling Recommender Systems

RecSys × LLM × RL

ARGUS: AutoRegressive Generative User Sequential Modeling

Next item prediction

Feedback prediction

Simplified ARGUS

ARGUS Implementation

Results

A/B Test Results and Measurements

Implementation Results

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.