User motion sequences are amongst probably the most powerful inputs in recommender systems: your next click, read, watch, play, or purchase is probably going at the very least somewhat related to what you’ve clicked on, read, watched, played, or purchased minutes, hours, days, months, and even years ago.
Historically, the established order for modeling such user engagement sequences has been pooling: for instance, a classic 2016 YouTube paper describes a system that takes the most recent 50 watched videos, collects their embeddings from an embedding table, and pools these right into a single feature vector with sum pooling. To avoid wasting memory, the embedding table for these sequence videos is shared with the embedding table for candidate videos themselves.
This simplistic approach corresponds roughly to a bag-of-words approach within the NLP domain: it really works, but it surely’s removed from ideal. Pooling doesn’t have in mind the sequential nature of inputs, nor the relevance of the item within the user history with respect to the candidate item we’d like to rank, nor any of the temporal information: an…