Constructing Airbnb Categories with ML & Human within the Loop Aligning ML Models with Human review tasks Future work

Airbnb Categories Blog Series — Part II : ML Categorization

by: ,

Airbnb 2022 release introduced Categories, a browse focused product that enables the user to hunt inspiration by browsing collections of homes revolving around a standard theme, equivalent to Lakefront, Countryside, Golf, Desert, National Parks, Browsing, etc. In Part I of our Categories Blog Series we covered the high level approach to creating Categories and showcasing them within the product. On this Part II we’ll describe the ML Categorization work in additional detail.

Throughout the post we use the as a running example to showcase the ML-powered category development process. Similar process was applied for other categories, with category specific nuances. For instance, some categories rely more on points of interests, while others more on structured listing signals, image data, etc.

Category development starts with a product-driven category definition: “Lakefront category should include listings which can be lower than 100 meters from the lake”. While this may occasionally sound like a simple task at first, it is vitally delicate and complicated because it involves leveraging multiple structured and unstructured listing attributes, points of interest (POIs), etc. It also involves training ML models that mix them, since not one of the signals captures all the space of possible candidates on their very own.

Listing Understanding Signals

As part of varied past projects multiple teams at Airbnb frolicked on processing several types of raw data to extract useful information in structured form. Our goal was to leverage these signals for cold-start rule-based category candidate generation and later use them as features of the ML model that might find category candidates with higher precision:

, equivalent to (e.g. castle, houseboat), (pool, fire pit, forest view, etc.). , that could be scanned for keywords (we gathered exhaustive sets of keywords in numerous languages per category).
, where hosts recommend nearby places for guests to go to (e.g. a Vineyard, Surf beach, Golf course) which hold locations data that was useful for extracting
, equivalent to Browsing, Golfing, Scuba, etc. proved useful in identifying listing candidates for certain activity-related categories.
which is one other source that could be scanned for . We also collect supplemental guest reviews where guests provide
that guests create when browsing, equivalent to “Golf trip 2022”, “Beachfront”, “Yosemite trip”, are sometimes related to considered one of the categories, which proved useful for candidate generation.

Figure 1. Popular wishlists created by airbnb users

The listing understanding knowledge base was further enriched using external data, equivalent to (tell us if a list is near an ocean, river or lake), , (tells us if listing is in rural, urban or metropolitan area) and that comprises names and locations of places of interest from host guidebooks or collected by us via open source datasets and further improved, enriched and adjusted by in-house human review.

Finally, we leveraged our in-house ML models for extra knowledge extraction from raw listing data. These included , ,, and . Each of those were useful in numerous stages of category development, candidate generation, expansion and quality prediction, respectively.

Once a category is defined, we first leverage pre-computed listing understanding signals and ML model outputs described within the previous section to codify the definition with a algorithm. Our candidate generation engine then applies them to supply a set of rule-based candidates and prioritizes them for human review based on a category confidence rating.

This confidence rating is computed based on what number of signals qualified the listing to the category and the weights related to each rule. For instance, considering Lakefront category, vicinity to a Lake POIs carried essentially the most weight, host provided signals on direct lake access were next more essential, lakefront keywords present in listing title, description, wishlists, reviews carried less weight, while lake and water detection in listing images carried the least weight. An inventory that may have all these attributes would have a really high confidence rating, while a list that may have just one would have a lower rating.

Candidates were sent for human review each day, by choosing a certain variety of listings from each category with the best category confidence rating. Human agents then judged if listing belongs to the category, select one of the best cover photo and assessed the standard of the listing (Figure 3)

As human reviews began rolling in and there have been enough listings with confirmed and rejected category tags it unlocked recent candidate generation techniques that began contributing their very own candidates:

leveraging distance to the confirmed listing in a given category, e.g. neighbor of a confirmed Lakefront listing it may be Lakefront
: leveraging listing embeddings to search out listings which can be most much like confirmed listing in a given category.
: once the agents reviewed 20% of rule-based candidates we began training ML models.

To start with, only agent vetted listings were sent to production and featured on the homepage. Over time, as our candidate generation techniques produced more candidates and the feedback loop repeated, it allowed us to coach higher and higher ML models with more labeled data. Finally, sooner or later, when ML models were ok, we began sending listings with high enough model scores to production (Figure 2).

Figure 2. Variety of listings in production per category and fractions vetted by humans

To be able to scale the review process we trained ML models that mimic each of the three human agent tasks (Figure 3). In the next sections we’ll reveal the training and evaluation process involved with each model

Figure 3. ML models setup for mimicking human review

ML Categorization Model task was to confidently place listings in a category. These models were trained using Bighead (Airbnb’s ML platform) as XGBoost binary per category classification models. They used agent category assignments as labels and signals described within the Listing Understanding section as features. Versus a rule-based setting, ML models allowed us to have higher control of the precision of candidates via model rating threshold.

Although many features are shared across categories and one could train a single multiclass model, attributable to the high imbalance in category sizes and dominance of category-specific features we found it higher to coach dedicated ML per category models. One other big reason for this was that a serious change to a single category, equivalent to change in definition, large addition of recent POIs or labels, didn’t require us to retrain, launch and measure impact on all of the categories, but as an alternative conveniently work on a single category in isolation.

: step one was to construct features, with a very powerful one being distance to Lake POI. We began with collecting Lake POIs represented as a single point and later added lake boundaries that trace the lake, which greatly improved the accuracy of having the ability to pull listings near the boundary. Nevertheless, as shown in Figure 4, even then there have been many edge cases that result in mistakes in rule-based listing project.

Figure 4. Examples of imperfect POI (left) and complicated geography: highway between lake and residential (middle), long backyards (right)

These include imperfect lake boundaries that could be contained in the water or outside on land, highways in between lake and houses, houses on cliffs, imperfect listing location, missing POIs, and POIs that usually are not actual lakes, like reservoirs, ponds etc. For that reason, it proved helpful to mix POI data with other listing signals as ML model features after which use the model to proactively improve the Lake POI database.

One modeling maneuver that proved to be useful here was . Since a lot of the features were also used for generating rule-based candidates that were graded by agents, leading to labels which can be utilized by the ML model, there was a risk of overfitting and limited pattern discovery beyond the principles.

To deal with this problem, during training we might randomly drop some feature signals, equivalent to distance from Lake POI, from some listings. Because of this, the model didn’t over depend on the dominant POI feature, which allowed listings to have a high ML rating even in the event that they usually are not near any known Lake POI. This allowed us to search out missing POIs and add them to our database.

: were assigned to listings agents tagged as Lakefront, were assigned to listings sent for review as Lakefront candidates but rejected (from modeling perspective). We also sampled negatives from related Lake House category that enables greater distance to lake () and listings tagged in other categories ()

70:30 random split, where we had special handling of distance and embedding similarity features to not leak the label.

Figure 5. Lakefront ML model feature importance and performance evaluation

We trained several models using different feature subsets. We were keen on how well POI data can do by itself and what improvements can additional signals provide. As it might be observed in Figure 5, the POI distance is a very powerful feature by far. Nevertheless, when used by itself it cannot approach the ML model performance. Specifically, the ML model improves Average Precision by 23%, from 0.74 to 0.91, which confirmed our hypothesis.

Because the POI feature is a very powerful feature we invested in improving it by adding recent POIs and refining existing POIs. This proved to be helpful because the ML model using improved POI features greatly outperforms the model that used initial POI features (Figure 5).

The technique of Lake POI refinement included leveraging trained ML model to by inspecting listings which have a high model rating but are removed from existing Lake POIs (Figure 6 left) and by inspecting listings which have a low model rating but are very near an existing Lake POI (Figure 6 right)

Figure 6. Means of finding missing POIs (Left) and improper POIs (Right)

using the test set Precision-Recall curve we found a threshold that achieves 90% Precision. We used this threshold to make a choice on which candidates can go on to production and which must be sent for human review first.

To perform the second agent task with ML, we wanted to coach a unique style of ML model. One whose task can be to decide on essentially the most appropriate listing cover photo given the category context. For instance, selecting a list photo with a lake view for the Lakefront category.

We tested several out of the box object detection models in addition to several in-house solutions trained using human review data, i.e. (listing id, category, cover photo id) tuples. We found that one of the best cover photo selection accuracy was achieved by fine-tuning a Vision Transformer model (VT) using our human review data. Once trained, the model can rating all listing photos and choose which one is one of the best cover photo for a given category.

To judge the model we used a hold out dataset and tested if the agent chosen listing photo for a specific category was inside the top 3 highest scoring VT model photos for a similar category. The typical Top 3 precision on all categories was 70%, which we found satisfactory.

To further test the model we judged if the VT chosen photo represented the category higher than the Host chosen cover photo (Figure 7). It was found that the VT model can select a greater photo in 77% of the cases. It must be noted that the Host chosen cover photo is often chosen without taking any category under consideration, because the one which best represents the listing within the search feed.

Figure 7. Vision Transformer vs. Host chosen cover photo selection for a similar listing for Lakefront category

Along with choosing one of the best cover photo for candidates which can be sent to production by the ML categorization model, the VT model was also used to hurry up the human review process. By ordering the candidate listing photos in descending order of the VT rating we were capable of improve the time it takes the agents to make a choice on a category and canopy photo by 18%.

Finally, for some highly visual categories, equivalent to Design, Creative spaces, the VT model proved to be useful for direct candidate generation.

Quality ML Model

The ultimate human review task is to guage the standard of the listing by choosing considered one of the 4 tiers: Most Inspiring, High Quality, Acceptable, Low Quality. As we’ll discuss in Part III of the blog series, the standard plays a task in rating of listings within the search feed.

To coach an ML model that may predict quality of a list we used a mixture of engagement, quality and visual signals to create a feature set and agent quality tags to create labels. The features included review rankings, wishlists, image quality, embedding signals and listing amenities and attributes, equivalent to price, variety of guests, etc.

Given the multi-class setup with 4 quality tiers, we experimented with different loss functions (pairwise loss, one-vs-all, one-vs-one, multi label, etc.). We then compared the ROC curves of various strategies on a hold-out set and the binary one-vs-all models performed one of the best.

Figure 8: Quality ML model feature importance and ROC curve

Along with playing a task in search rating, the Quality ML rating also played a task within the human review prioritization logic. With all three ML models functional for all three human review tasks, we could now streamline the review process and send more candidates on to production, while also prioritizing some for human review. This prioritization plays a vital role within the system because listings which can be vetted by humans may rank higher within the category feed.

There have been several aspects to think about when prioritizing listings for human review, including listing category confidence rating, listing quality, bookability and recognition of the region. One of the best strategy proved to be a mixture of those aspects. In Figure 9 we show the highest candidates for human review for several categories on the time of writing this post.

Figure 9: Listing prioritized for review in 4 different categories

Once graded, those labels are then used for periodical model re-training in an energetic feedback loop that constantly improves the category accuracy and coverage.

Our future work involves iterating on the three ML models in several directions, including generating a bigger set of labels using generative vision models and potentially combining them right into a single multi-task model. We’re also exploring ways of using Large Language Models (LLMs) for conducting category review tasks

If any such work interests you, take a look at a few of our related roles!

Constructing Airbnb Categories with ML & Human within the Loop Aligning ML Models with Human review tasks Future work

Listing Understanding Signals

Quality ML Model

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

AI’s Growing Power Needs: Tech Industry’s Move Towards Nuclear Power

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

Constructing Airbnb Categories with ML & Human within the Loop Aligning ML Models with Human review tasks Future work

Listing Understanding Signals

Quality ML Model

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.