The industry’s outliers have distorted our definition of Recommender Systems. TikTok, Spotify, and Netflix employ hybrid deep learning models combining collaborative- and content-based filtering to deliver personalized recommendations you didn’t even know you’d like. Should you’re considering a RecSys role, you may expect to dive into these instantly. But not all RecSys problems operate — or must operate — at this level. Most practitioners work with relatively easy, tabular models, often gradient-boosted trees. Until attending RecSys ’25 in Prague, I assumed my experience was an outlier. Now I consider that is the norm, hidden behind the large outliers that drive the industry’s state-of-the-art. So what sets these giants aside from most other firms? In this text, I take advantage of the framework mapped within the image above to reason about these differences and help place your individual advice work on the spectrum.
Most advice systems begin with a candidate generation phase, reducing hundreds of thousands of possible items to a manageable set that will be re-ranked by higher-latency solutions. But candidate generation isn’t at all times the uphill battle it’s made out to be, nor does it necessarily require machine learning. Contexts with well-defined scopes and hard filters often don’t require complex querying logic or vector search. Consider Booking.com: when a user searches for “4-star hotels in Barcelona, September 12-15,” the geography and availability constraints have already narrowed hundreds of thousands of properties right down to a number of hundred—even when the backend systems handling that filtering are themselves complex. The actual challenge for machine learning practitioners is then rating these hotels with precision. That is vastly different from Amazon’s product search or the YouTube homepage, where hard filters are absent. In these environments, the system must depend on semantic intent or past behavior to surface relevant candidates from hundreds of thousands or billions of things before re-ranking even takes place.
Beyond candidate generation, the complexity of re-ranking is best understood through the 2 dimensions mapped within the image below. First, , which determines how strong a baseline you possibly can have. Second, the and their learnability, which determines how complex your personalization solution needs to be.
Observable Outcomes and Catalog Stability
On the left end of the x-axis are businesses that directly observe their most vital outcomes. Large merchants like IKEA are an excellent example of this: when a customer buys an ESKILSTUNA sofa as an alternative of a KIVIK, the signal is unambiguous. Aggregate enough of those, and the corporate knows exactly which product has the upper purchase rate. When you possibly can directly observe users voting with their wallets, you will have a robust baseline that’s hard to beat.
At the opposite extreme are platforms that may’t observe whether their recommendations actually succeeded. Tinder and Bumble might see users match, but they often won’t know whether the pair hit it off (especially as users move off to other platforms). Yelp and Google Maps can recommend restaurants, but for the overwhelming majority, they’ll’t observe whether you truly visited, just which listings you clicked. Counting on such upper-funnel signals means position bias dominates: items in top positions accumulate interactions no matter true quality, making it nearly unattainable to inform whether engagement reflects real preference or mere visibility. Contrast this with the IKEA example: a user might click a restaurant on Yelp just because it appeared first, but they’re far less prone to buy a settee for that very same reason. Within the absence of a tough conversion, you lose the anchor of a reliable leaderboard. This forces you to work much harder to extract signal from the noise. Reviews can offer some grounding, but they’re rarely dense enough to work as a primary signal. As an alternative, you might be left to run limitless experiments in your rating heuristics, always tuning logic to squeeze a proxy for quality out of a stream of weak signals.
High-Churn Catalog
Even with observable outcomes, nevertheless, a robust baseline will not be guaranteed. In case your catalog is always changing, chances are you’ll not accumulate enough data to construct a correct leaderboard. Real estate platforms like Zillow and secondhand sites like Vinted face essentially the most extreme version: each item has a listing of 1, disappearing the moment it’s purchased. This forces you to depend on simplistic and rigid sorts like “newest first” or “lowest price per square meter.” These are far weaker than conversion leaderboards based on real, dense user signal. To do higher, you should leverage machine learning to predict conversion probability immediately, combining intrinsic attributes with debiased short-term performance to surface one of the best inventory before it disappears.
The Ubiquity of Feature-Based Models
No matter your catalog’s stability or signal strength, the core challenge stays the identical: you are attempting to enhance upon whatever baseline is out there. This is often achieved by training a machine learning (ML) model to predict the probability of engagement or conversion given a particular context. Gradient-boosted trees (GBDTs) are the pragmatic alternative, much faster to coach and tune than deep learning.
GBDTs predict these outcomes based on engineered item features: categorical and numerical attributes that quantify and describe a product. Even before individual preferences are known, GBDTs may adapt recommendations leveraging basic user features like country and device type. With these item and user features alone, an ML model can already improve upon the baseline — whether which means debiasing a popularity leaderboard or rating a high-churn feed. For example, in fashion e-commerce, models commonly use location and time of 12 months to surface items tied to the season, while concurrently using country and device to calibrate the value point.
These features allow the model to combat the aforementioned position bias by separating true quality from mere visibility. By learning which intrinsic attributes drive conversion, the model can correct for the position bias inherent in your popularity baseline. It learns to discover items that perform on merit, somewhat than just because they were ranked at the highest. That is harder than it looks: you risk demoting proven winners greater than it is best to, potentially degrading the experience.
Contrary to popular belief, feature-based models may drive personalization, depending on how much semantic information items naturally contain. Platforms like Booking.com and Yelp accumulate wealthy descriptions, multiple photos, and user reviews that provide semantic depth per listing. These will be encoded into semantic embeddings for personalization: by utilizing the user’s recent interactions, we are able to calculate similarity scores against candidate items and feed these to the gradient-boosted model as features.
This approach has its limits, nevertheless. Feature-based models can recommend based on similarity to recent interactions, but unlike collaborative filtering, they don’t directly learn which items are inclined to be liked by similar users. To learn that, they need item similarity scores provided as input features. Whether this limitation matters relies on something more fundamental: how much users actually disagree.
Subjectivity
Not all domains are equally personal or controversial. In some, users largely agree on what makes an excellent product once basic constraints are satisfied. We call these convergent preferences, they usually occupy the underside half of the chart. Take Booking.com: travelers can have different budgets and site preferences, but once those are revealed through filters and map interactions, rating criteria converge — higher prices are bad, amenities are good, good reviews are higher. Or consider Staples: once a user needs printer paper or AA batteries, brand and price dominate, making user preferences remarkably consistent.
At the opposite extreme — the highest half — are subjective domains defined by highly fragmented taste. Spotify exemplifies this: one user’s favorite track is one other’s immediate skip. Yet, taste rarely exists in a vacuum. Somewhere in the information is a user in your exact wavelength, and machine learning bridges the gap, turning their discoveries from yesterday into your recommendations for today. Here, the worth of personalization is big, and so is the technical investment required.
The Right Data
Subjective taste is just actionable if you will have enough data to look at it. Many domains involve distinct preferences but lack the feedback loop to capture them. A distinct segment content platform, recent marketplace, or B2B product may face wildly divergent tastes yet lack the clear signal to learn them. Yelp restaurant recommendations illustrate this challenge: dining preferences are subjective, however the platform can’t observe actual restaurant visits, only clicks. This implies they’ll’t optimize personalization for the true goal (conversions). They’ll only optimize for proxy metrics like clicks, but more clicks might actually signal failure, indicating users are browsing multiple listings without finding what they need.
But in subjective domains with dense behavioral data, failing to personalize leaves money on the table. YouTube exemplifies this: with billions of each day interactions, the platform learns nuanced viewer preferences and surfaces videos you didn’t know you wanted. Here, deep learning becomes unavoidable. That is the purpose where you’ll see large teams coordinating over Jira and cloud bills that require VP approval. Whether that complexity is justified comes down entirely to the information you will have.
Know Where You Stand
Understanding where your problem sits on this spectrum is way more beneficial than blindly chasing the most recent architecture. The industry’s “state-of-the-art” is basically defined by the outliers — the tech giants coping with massive, subjective inventories and dense user data. Their solutions are famous because their problems are extreme, not because they’re universally correct.
Nevertheless, you’ll likely face different constraints in your individual work. In case your domain is defined by a stable catalog and observable outcomes, you land within the bottom-left quadrant alongside firms like IKEA and Booking.com. Here, popularity baselines are so strong that the challenge is solely constructing upon them with machine learning models that may drive measurable A/B test wins. If, as an alternative, you face high churn (like Vinted) or weak signals (like Yelp), machine learning becomes a necessity just to maintain up.
But that doesn’t mean you’ll need deep learning. That added complexity only truly pays off in territories where preferences are deeply subjective and there’s enough data to model them. We frequently treat systems like Netflix or Spotify because the gold standard, but they’re specialized solutions to rare conditions. For the remaining of us, excellence isn’t about deploying essentially the most complex architecture available; it’s about recognizing the constraints of the terrain and having the boldness to decide on the answer that solves your problems.
.
