Home Artificial Intelligence Deep Learning in Recommender Systems: A Primer

Deep Learning in Recommender Systems: A Primer

3
Deep Learning in Recommender Systems: A Primer

A tour of crucial technological breakthroughs behind modern industrial recommender systems

Image credit: Pixabay

Recommender systems are among the many fastest-evolving industrial Machine Learning applications today. From a business perspective, this will not be a surprise: higher recommendations bring more users. It’s so simple as that.

The underlying technology nevertheless is removed from easy. Ever for the reason that rise of deep learning — powered by the commoditization of GPUs — recommender systems have turn into increasingly more complex.

On this post, we’ll take a tour of a handful of crucial modeling breakthroughs from the past decade, roughly reconstructing the pivotal points marking the rise of deep learning in recommender systems. It’s a story of technological breakthroughs, scientific exploration, and an arms race spanning continents and cooperations.

Buckle up. Our tour starts in 2017’s Singapore.

Image credit: He et al (2017)

Any discussion of deep learning in recommender systems can be incomplete with no mention of one of the vital necessary breakthroughs in the sector, Neural Collaborative Filtering (NCF), introduced in He et al (2017) from the University of Singapore.

Prior to NCF, the gold standard in recommender systems was matrix factorization, during which we learn latent vectors (aka embeddings) for each users and items, after which generate recommendations for a user by taking the dot product between the user vector and the item vectors. The closer the dot product is to 1, as we all know from linear algebra, the higher the expected match. As such, matrix factorization may be simply viewed as a linear model of latent aspects.

The important thing idea in NCF is to switch the inner product in matrix factorization with a neural network. In practice, this is completed by first concatenating the user and item embeddings, after which passing them right into a multi-layer perceptron (MLP) with a single task head that predicts user engagement resembling click. Each the MLP weights and the embedding weights (which map ids to their respective embeddings) are then learned during model training via backpropagation of loss gradients.

The hypothesis behind NCF is that user/item interactions aren’t linear, as assumed in matrix factorization, but as an alternative non-linear. If that’s true, we should always see higher performance as we add more layers to the MLP. And that’s precisely what He et al find. With 4 layers, they’re capable of beat the most effective matrix factorization algorithms on the time by around 5% hit rate on the Movielens and Pinterest benchmark datasets.

He et al proved that there’s immense value of deep learning in recommender systems, marking the pivotal transition away from matrix factorization and towards deep recommenders.

Image credit: Cheng et al (2016)

Our tour continues from Singapore to Mountain View, California.

While NCF revolutionized the domain of recommender system, it lacks a very important ingredient that turned out to be extremely necessary for the success of recommenders: cross features. The concept of cross features has been popularized in Google’s 2016 paper “Wide & Deep Learning for Recommender Systems”.

What’s a cross feature? It’s a second-order feature that’s created by “crossing” two of the unique features. For instance, within the Google Play Store, first-order features include the impressed app, or the list of user-installed apps. These two may be combined to create powerful cross-features, resembling

AND(user_installed_app='netflix', impression_app='hulu')

which is 1 if the user has Netflix installed and the impressed app is Hulu.

Cross features may also be more generalized resembling

AND(user_installed_category='video', impression_category='music')

and so forth. The authors argue that adding cross features of various granularities enables each memorization (from more granular crosses) and generalization (from less granular crosses).

The important thing architectural alternative in Wide&Deep is to have each a large module, which is a linear layer that takes all cross features directly as inputs, and a deep module, which is actually an NCF, after which mix each modules right into a single output task head that learns from user/app engagements.

And indeed, Wide&Deep works remarkably well: the authors discover a lift in online app acquisitions of 1% by going from deep-only to wide and deep. Consider that Google makes tens of Billions in revenue every year from its Play Store, and it’s easy to see how impactful Wide&Deep was.

Image credit: Wang et al (2017)

Wide&Deep has proven the importance of cross features, nevertheless it has an enormous downside: the cross features must be manually engineered, which is a tedious process that requires engineering resources, infrastructure, and domain expertise. Cross features à la Wide & Deep are expensive. They don’t scale.

Enter “Deep and Cross neural networks” (DCN), introduced in a 2017 paper, also from Google. The important thing idea in DCN is to switch the wide component in Wide&Deep with a “cross neural network”, a neural network dedicated to learning cross features of arbitrarily high order.

What makes a cross neural network different from a normal MLP? As a reminder, in an MLP, each neuron in the subsequent layer is a linear combination of all layers within the previous layer:

Against this, within the cross neural network the subsequent layer is constructed by forming second-order mixtures of the primary layer with itself:

Hence, a cross neural network of depth L will learn cross features in the shape of polynomials of degrees as much as L. The deeper the neural network, the higher-order interactions are learned.

And indeed, the experiments confirm that DCN works. In comparison with a model with just the deep component, DCN has a 0.1% lower logloss (which is taken into account to be statistically significant) on the Criteo display ads benchmark dataset. And that’s with none manual feature engineering, as in Wide&Deep!

(It could have been nice to see a comparison between DCN and Wide&Deep. Alas, the authors of DCN didn’t have an excellent method to manually create cross features for the Criteo dataset, and hence skipped this comparison.)

Image credit: Guo et al (2017)

Next, our tour takes us from 2017’s Google to 2017’s Huawei.

Huawei’s solution for deep suggestion, “DeepFM”, also replaces manual feature engineering within the wide component of Wide&Deep with a dedicated neural network that learns cross features. Nevertheless, unlike DCN, the wide component will not be a cross neural network, but as an alternative a so-called FM (“factorization machine”) layer.

What does the FM layer do? It’s simply taking the dot-products of all pairs of embeddings. For instance, if a movie recommender takes 4 id-features as inputs, resembling user id, movie id, actor ids, and director id, then the model learns embeddings for all of those id features, and the FM layer computes 6 dot products, corresponding to the mixtures user-movie, user-actor, user-director, movie-actor, movie-director, and actor-director. It’s a comeback of the concept of matrix factorization. The output of the FM layer is then combined with the output of the deep component right into a sigmoid-activated output, leading to the model’s prediction.

And indeed, as you will have guessed, DeepFM has been shown to work. The authors show that DeepFM beats a number of the competitors (including Google’s Wide&Deep) by greater than 0.37% and 0.42% by way of AUC and Logloss, respectively, on company-internal data.

Image credit: Naumov et al (2019)

Let’s leave Google and Huawei for now. The subsequent stop on our tour is 2019’s Meta.

Meta’s DLRM (“deep learning for recommender systems”) architecture, presented in Naumov et al (2019), works as follows: all categorical features are transformed into embeddings using embedding tables. All dense features are being passed into an MLP that computes embeddings for them as well. Importantly, all embeddings have the identical dimension. Then, we simply compute the dot products of all pairs of embeddings, concatenate them right into a single vector, and pass that vector through a final MLP with a single sigmoid-activated task head that produces the prediction.

DLRM, then, is nearly something like a simplified version of DeepFM: in case you take DeepFM and drop the deep component (keeping just the FM component), you will have something like DLRM, but without DLRM’s dense MLP.

In experiments, Naumov et al show that DLRM beats DCN by way of each training and validation accuracy on the Criteo display ads benchmark dataset. This result indicates that the deep component in DCN may indeed be redundant, and all that we actually need with a view to make the most effective possible recommendations are only the feature interactions, which in DLRM are captured with the dot products.

Image credit: Zhang et al (2022)

In contrast to DCN, the feature interactions in DLRM are limited to be second-order only: they’re just dot products of all pairs of embeddings. Going back to the movie example (with features user, movie, actors, director), the second-order interactions can be user-movie, user-actor, user-director, movie-actor, movie-director, and actor-director. A 3rd-order interaction can be something like user-movie-director, actor-actor-user, director-actor-user, and so forth. Certain users could also be fans of Steven Spielberg movies starring Tom Hanks, and there ought to be a cross feature for that! Alas, in standard DLRM, there isn’t. That’s a significant limitation.

Enter DHEN, the ultimate landmark paper in our tour of recent recommender systems. DHEN stands for “Deep Hierarchical Ensemble Network”, and the important thing idea is to create a “hierarchy” of cross features that grows deeper with the variety of DHEN layers.

It’s easiest to know DHEN with an easy example first. Suppose we now have two input features going into DHEN, and let’s denote them by A and B (which could stand for user ids and video ids, for instance). A 2-layer DHEN module would then create the whole hierarchy of cross features as much as second order, namely:

A, AxA, AxB, B, BxB,

where “x” is either one or a mix of the next 5 interactions:

  • dot product,
  • self-attention,
  • convolution,
  • linear: y = Wx, or
  • the cross module from DCN.

DHEN is a beast, and its computational complexity (because of its recursive nature) is nightmare. With a purpose to get it to work, the authors of the DHEN paper needed to invent a latest distributed training paradigm called “Hybrid Sharded Data Parallel”, which achieves 1.2X higher throughput than the (then) state-of-the-art.

But most significantly, the beast works: of their experiments on internal click-through rate data, the authors measure a 0.27% improvement in NE in comparison with DLRM, using a stack of 8 (!) DHEN layers.

Evolution of the Criteo display ads competition leaderboard. Screenshot from paperswithcode.com.

And this concludes our tour. Allow me to summarize each of those landmarks with a single headline:

  • NCF: All we want are embeddings for users and items. The MLP will handle the remainder.
  • Wide&Deep: Cross features matter. In truth, they’re so necessary we feed them directly into the duty head.
  • DCN: Cross features matter, but shouldn’t be engineered by hand. Let the cross neural network handle that.
  • DeepFM: Let’s generate cross features within the FM layer as an alternative, and still keep the deep component from Wide&Deep.
  • DRLM: FM is all we want — and in addition one other, dedicated MLP for dense features.
  • DHEN: FM will not be enough. We want a hierarchy of higher-order (beyond second order), hierarchical feature interactions. And in addition a bunch of optimizations to make it work in practice.

And the journey is actually just getting began. On the time of this writing, DCN has evolved into DCN-M, DeepFM has evolved into xDeepFM, and the leaderboard of the Criteo competition has been claimed by Huawei’s latest invention, FinalMLP.

Given the massive economic incentive for higher recommendations, it’s guaranteed that we’ll proceed to see latest breakthroughs on this domain for the foreseeable future. Watch this space.

3 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here