Hashing in Modern Recommender Systems: A Primer



Hashing is some of the common “tricks” utilized in industrial Machine Learning applications, yet it doesn’t get nearly as much attention because it deserves.

The largest advantage of hashing, especially in modern recommender systems, is its finite-memory guarantee: without hashing, it could be highly impractical to learn the relevance of billions of videos, news articles, photos, or web pages for billions of users without running out of memory.

But we’re getting ahead of ourselves here. This post is a primer, so let’s return to where it began: the famous 2009 “hashing trick” paper.

The paper that began all of it

The thought of using hashing as a option to process features, in addition to the term “hashing trick”, were first introduced in a 2009 paper by a team of researchers from Yahoo, led by Kilian Weinberger, within the context of Email spam detection. An email, in spite of everything, is a sequence of words, and every word might be regarded as a feature. With hashing, the authors explain, we are able to represent emails as vectors (where each index corresponds to a specific word), and use these vectors in a spam collaborative filtering model.

For instance, the phrase “Your prescription is prepared” in a 20-dimensional hash space could also be corresponding to


and so forth, where the position of those is dependent upon the actual alternative of hash function used.

Nonetheless, different words can have different relevance to different people. The word “prescription” may signal spam to 1 user, but not to a different. So as to account for this difference in relevance, the authors introduce “personalized” tokens, through which they mix the user id with the words themselves. For the phrase above, they’d not only hash the tokens

"your", "prescription",  "is", "ready",

but in addition

"user123_your", "user123_prescription", "user123_is", "user123_ready",

and so forth. Given their dataset of 40 Million unique words and 400,000 users, this personalized tokenization leads to a complete of 16 Trillion possible features.

By hashing these personalized features right into a hash size of 2²², or roughly 4.2 Million (a discount of seven orders of magnitude!), the authors achieve a complete spam reduction of 30%, in comparison with a baseline model with no hashing-powered personalization, considered one of the primary clear demonstrations of the usefulness of hashing in ML problems.

Hashing in modern recommender systems

Fast forward from 2009 to today, and we see that plenty of the ML technology has modified (deep neural nets have largely replaced linear collaborative filters), but hashing continues to be there.

Modern recommender systems are often some variant of a two-tower neural network, where one tower learns user embeddings from user ids, and the opposite tower learns, say, video embedding from video ids (for a video recommender system). At training time the model is given user/video pairs from historic engagement data, comparable to video impressions with clicks as positives, and people without clicks as negatives, leading to a shared embedding space for users and videos. We will then store the embeddings for all of our users and videos in a set of embedding tables, and use these tables to make advice at serving time, for instance using a kNN algorithm.

To this point so good, but where does hashing slot in here?

Well, consider a model that stores modest 100-dimensional embeddings of 1B videos and 1B users: that’s already 800GB of memory. It is a huge memory footprint, and the model can be very impractical (if not unimaginable) and expensive to serve. With hashing, we are able to first reduce the “vocabulary” of videos and users right down to, say, 10M each, leading to a more manageable memory footprint of 8GB.

Hashing, in other words, allows us to scale modern recommender systems to Billions of users and Billions of things. Without hashing, the size of recommender systems can be fundamentally limited by the quantity of memory we are able to afford.

Towards collision-free hashing

The downside of hashing is the existence of hash collisions, the undeniable fact that multiple different ids will eventually be mapped into the identical hash, leading to multiple users or items sharing the identical embedding within the embedding table. Obviously, this “cramping” of knowledge can degrade the resulting recommendations. One of the vital research questions in recommender systems today is due to this fact the best way to make them collision-free.

One idea is the usage of “Deep Hash Embeddings” (DHE), proposed in 2021 by a team from Google Brain led by Wang-Cheng Kang. As an alternative of a single hash function, they use numerous hash functions, and concatenate all hashes for a single id right into a dense vector, which is then fed right into a deep neural net which learns higher-oder representations of the ids. The important thing idea is that, if the variety of hash functions is large enough, then hash collisions develop into statistically unimaginable.

And indeed, their approach shows promise: using DHE with 1024 hash functions, the authors see an improvement of 0.25% in AUC over regular hashing on a set of public benchmark datasets. One caveat with this approach though is that’s doesn’t work for multivalent features, just for ids directly. (For instance, DHE, as proposed here, couldn’t encode a listing of flicks you’ve watched on Netflix up to now 30 days, which can be a multivalent feature.)

One other promising collision-free hashing approach is “Cuckoo hashing”, first introduced in 2001 by Rasmus Pagh and Flemming Rodler at Arhus University, Denmark. Cuckoo hashing is utilized in Monolith, ByteDance’s online advice system, which is introduced of their 2022 paper, led by Zhuoran Liu.

In Cuckoo hashing we maintain not one but multiple hash tables: in ByteDance’s paper, they use 2. Once we encounter a latest id, by default we hash it into the primary hash table. If the primary hash table is already occupied with one other id, we evict that id (hence the name of the algorithm) and re-hash it right into a spot within the second hash table. This process is repeated until there aren’t any more evictions, and the set of hash tables stabilizes without collisions. In comparison with regular hashing (with collisions) the authors find an improvement of 0.5% AUC on a public benchmark dataset with their collision-free hashing approach.

Coda: probably the most underrated trick in applied Machine Learning

Recap for memory:

  • Hashing allows us to scale modern recommender systems to Billions of users and Billions of things with a finite-memory guarantee.
  • The thought of using hashing in ML application dates back to a 2009 paper from Yahoo, which demonstrated its usefulness in an Email spam detection model.
  • Hashing nevertheless introduces hash collisions, which may degrade recommendations on account of multiple users and items sharing the identical embedding.
  • For that reason, recent research in recommender systems focuses on the best way to make hashing collision-free. Notable examples are Deep Hash Embeddings (Google Brain) and Cuckoo Hashing (ByteDance).

For a few years, hashing has been amongst probably the most underrated tricks within the applied ML literature, with much of the eye focusing as an alternative on model architecture, data selection, or feature engineering.

This will likely begin to alter. Because the recent papers from Google Brain, ByteDance, and others have shown, optimizing recommender systems for hash collisions can enable notable gains in performance. The surprising rise in popularity of TikTok (owned by ByteDance) may, a minimum of partly, be explained by higher hashing.

Watch this space: latest breakthroughs are definitely on the horizon.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x