, I’d wish to share a practical variation of Uber’s Two-Tower Embedding (TTE) approach for cases where each user-related data and computing resources are limited. The issue got here from a high traffic discovery widget on the house screen of a food delivery app. This widget shows curated selections comparable to , , , or . The selections are created from tags: each restaurant can have multiple tags, and every tile is basically a tag-defined slice of the catalog (with the addition of some manual picking). In other words, the candidate set is already known, so the true problem shouldn’t be retrieval but rating.
At the moment this widget was significantly underperforming compared to other widgets on a discovery (primary) screen. The ultimate selection was ranked on general popularity without bearing in mind any personalized signals. What we discovered is that users are reluctant to scroll and in the event that they don’t find something interesting throughout the first 10 to 12 positions then they typically don’t convert. Then again the selections will be massive sometimes, in some cases as much as 1500 restaurants. On top of that a single restaurant might be chosen for various selections, which suggests that for instance McDonald’s will be chosen for each and , however it’s clear that its popularity is just valid for the primary selection, but the overall popularity sorting would put it on top in each selections.
The product setup makes the issue even less friendly to static solutions comparable to general popularity sorting. These collections are dynamic and alter incessantly as a result of seasonal campaigns, operational needs, or recent business initiatives. Due to that, training a dedicated model for every individual selection shouldn’t be realistic. A useful recommender has to generalize to recent tag-based collections from day one.
Before moving to a two-tower-style solution, we tried simpler approaches comparable to localized popularity rating on the city-district level and multi-armed bandits. In our case, neither delivered a measurable uplift over a general popularity sort. As an element of our research initiative we tried to regulate Uber’s TTE for our case.
Two-Tower Embeddings Recap
A two-tower model learns two encoders in parallel: one for the user side and one for the restaurant side. Each tower produces a vector in a shared latent space, and relevance is estimated from a similarity rating, often a dot product. The operational advantage is decoupling: restaurant embeddings will be precomputed offline, while the user embedding is generated online at request time. This makes the approach attractive for systems that need fast scoring and reusable representations.
Uber’s write-up focused mainly on retrieval, however it also noted that the identical architecture can function a final rating layer when candidate generation is already handled elsewhere and latency must remain low. That second formulation was much closer to our use case.
Our Approach
We kept the two-tower structure but simplified essentially the most resource-heavy parts. On the restaurant side, we didn’t fine-tune a language model contained in the recommender. As a substitute, we reused a TinyBERT model that had already been fine-tuned for search within the app and treated it as a frozen semantic encoder. Its text embedding was combined with explicit restaurant features comparable to price, rankings, and up to date performance signals, plus a small trainable restaurant ID embedding, after which projected into the ultimate restaurant vector. This gave us semantic coverage without paying the complete cost of end-to-end language-model training. For a POC or MVP, a small frozen sentence-transformer could be an inexpensive start line as well.
We avoided learning a dedicated user-ID embedding and as a substitute represented each user on the fly through their previous interactions. The user vector was built from averaged embeddings of restaurants the client had ordered from (Uber’s post mentioned this source as well, however the authors don’t specify the way it was used), along with user and session features. We also used views without orders as a weak negative signal. That mattered when order history was sparse or irrelevant to the present selection. If the model couldn’t clearly infer what the user liked, it still helped to know which restaurants had already been explored and rejected.
An important modeling selection was filtering that history by the tag of the present selection. Averaging the entire order history created an excessive amount of noise. If a customer mostly ordered burgers after which opened an selection, a world average could pull the model toward burger places that happened to sell desserts somewhat than toward the strongest ice cream candidates. By filtering past interactions to matching tags before averaging, we made the user representation contextual as a substitute of world. In practice, this was the difference between modeling long-term taste and modeling current intent.
Finally, we trained the model on the session level and used multi-task learning. The identical restaurant might be positive in a single session and negative in one other, depending on the user’s current intent. The rating head predicted click, add-to-basket, and order jointly, with an easy funnel constraint: P(order) ≤ P(add-to-basket) ≤ P(click). This made the model less static and improved rating quality compared with optimizing a single goal in isolation.
Offline validation was also stricter than a random split: evaluation used out-of-time data and users unseen during training, which made the setup closer to production behavior.
Outcomes
In response to A/B tests the ultimate system showed a statistically significant uplift in conversion rate. Just as importantly, it was not tied to 1 widget. Since the model scores a user–restaurant pair somewhat than a set list, it generalized naturally to recent selections without architectural changes since tags are a part of restaurant’s metadata and will be retrieved without selections in mind.
That transferability made the model useful beyond the unique rating surface. We later reused it in Ads, where its CTR-oriented output was applied to individual promoted restaurants with positive results. The identical representation learning setup subsequently worked each for selection rating and for other recommendation-like placement problems contained in the app.
Further Research
Probably the most obvious next step is multimodality. Restaurant images, icons, and potentially menu visuals will be added as extra branches to the restaurant tower. That matters because click behavior is strongly influenced by presentation. A pizza place inside a pizza selection may underperform if its primary image doesn’t show pizza, while a budget restaurant can look premium purely due to its hero image. Text and tabular features don’t capture that gap well.
Key Takeaways:
- Two-Tower models can work even with limited data. You don’t need Uber-scale infrastructure if candidate retrieval is already solved and the model focuses only on the rating stage.
- Reuse pretrained embeddings as a substitute of coaching from scratch. A frozen lightweight language model (e.g., TinyBERT or a small sentence-transformer) can provide strong semantic signals without expensive fine-tuning.
- Averaging embeddings of previously ordered restaurants works surprisingly well when user history is sparse.
- Contextual filtering reduces noise and helps the model capture the user’s current intent, not only long-term taste.
- Negative signals assist in sparse environments. Restaurants that users viewed but didn’t order from provide useful information when positive signals are limited.
- Multi-task learning stabilizes rating. Predicting click, add-to-basket, and order jointly with funnel constraints produces more consistent scores.
- Design for reuse. A model that scores user–restaurant pairs somewhat than specific lists will be reused across product surfaces comparable to selections, search rating, or ads.
