By Guru Tahasildar, Amir Ziai, Jonathan Solórzano-Hamilton, Kelli Griggs, Vi Iyengar
Netflix leverages machine learning to create the perfect media for our members. Earlier we shared the small print of certainly one of these algorithms, introduced how our platform team is evolving the media-specific machine learning ecosystem, and discussed how data from these algorithms gets stored in our annotation service.
Much of the ML literature focuses on model training, evaluation, and scoring. On this post, we are going to explore an understudied aspect of the ML lifecycle: integration of model outputs into applications.
Specifically, we are going to dive into the architecture that powers search capabilities for studio applications at Netflix. We discuss specific problems that we now have solved using Machine Learning (ML) algorithms, review different pain points that we addressed, and supply a technical overview of our latest platform.
At Netflix, we aim to bring joy to our members by providing them with the chance to experience outstanding content. There are two components to this experience. First, we must provide the content that may bring them joy. Second, we must make it effortless and intuitive to select from our library. We must quickly surface essentially the most stand-out highlights from the titles available on our service in the shape of images and videos within the member experience.
Here is an example of such an asset created for certainly one of our titles:
These multimedia assets, or “supplemental” assets, don’t just come into existence. Artists and video editors must create them. We construct creator tooling to enable these colleagues to focus their time and energy on creativity. Unfortunately, much of their energy goes into labor-intensive pre-work. A key opportunity is to automate these mundane tasks.
Use case #1: Dialogue search
Dialogue is a central aspect of storytelling. The most effective ways to inform an enticing story is thru the mouths of the characters. Punchy or memorable lines are a first-rate goal for trailer editors. The manual method for identifying such lines is a watchdown (aka breakdown).
An editor watches the title start-to-finish, transcribes memorable words and phrases with a timecode, and retrieves the snippet later if the quote is required. An editor can select to do that quickly and only jot down essentially the most memorable moments, but may have to rewatch the content in the event that they miss something they need later. Or, they will do it thoroughly and transcribe your entire piece of content ahead of time. Within the words of certainly one of our editors:
Watchdowns / breakdown are very repetitive and waste countless hours of creative time!
Scrubbing through hours of footage (or dozens of hours if working on a series) to seek out a single line of dialogue is profoundly tedious. In some cases editors need to go looking across many shows and manually doing it is just not feasible. But what if scrubbing and transcribing dialogue is just not needed in any respect?
Ideally, we wish to enable dialogue search that supports the next features:
- Search across one title, a subset of titles (e.g. all dramas), or your entire catalog
- Search by character or talent
- Multilingual search
Use case #2: Visual search
An image is price a thousand words. Visual storytelling can assist make complex stories easier to grasp, and in consequence, deliver a more impactful message.
Artists and video editors routinely need specific visual elements to incorporate in artworks and trailers. They could scrub for frames, shots, or scenes of specific characters, locations, objects, events (e.g. a automotive chasing scene in an motion movie), or attributes (e.g. a close-up shot). What if we could enable users to seek out visual elements using natural language?
Here is an example of the specified output when the user searches for “red race automotive” across your entire content library.
Use case #3: Reverse shot search
Natural-language visual search offers editors a strong tool. But what in the event that they have already got a shot in mind, they usually want to seek out something that just looks similar? As an illustration, let’s say that an editor has found a visually stunning shot of a plate of food from Chef’s Table, and she or he’s curious about finding similar shots across your entire show.
Approach #1: on-demand batch processing
Our first approach to surface these innovations was a tool to trigger these algorithms on-demand and on a per-show basis. We implemented a batch processing system for users to submit their requests and wait for the system to generate the output. Processing took several hours to finish. Some ML algorithms are computationally intensive. Most of the samples provided had a major variety of frames to process. A typical 1 hour video could contain over 80,000 frames!
After waiting for processing, users downloaded the generated algo outputs for offline consumption. This limited pilot system greatly reduced the time spent by our users to manually analyze the content. Here’s a visualization of this flow.
Approach #2: enabling online request with pre-computation
After the success of this approach we decided so as to add online support for a few algorithms. For the primary time, users were capable of discover matches across your entire catalog, oftentimes finding moments they never knew even existed. They didn’t need any time-consuming local setup and there was no delays because the data was already pre-computed.
The next quote exemplifies the positive reception by our users:
“We wanted to seek out all of the shots of the dining room in a show. In seconds, we had what normally would have taken 1–2 people hours/a full day to do, glance through all of the shots of the dining room from all 10 episodes of the show. Incredible!”
Dawn Chenette, Design Lead
This approach had several advantages for product engineering. It allowed us to transparently update the algo data without users knowing about it. It also provided insights into query patterns and algorithms that were gaining traction amongst users. As well as, we were capable of perform a handful of A/B tests to validate or negate our hypotheses for tuning the search experience.
Our early efforts to deliver ML insights to creative professionals proved helpful. At the identical time we experienced growing engineering pains that limited our ability to scale.
Maintaining disparate systems posed a challenge. They were first built by different teams on different stacks, so maintenance was expensive. Each time ML researchers finished a latest algorithm that they had to integrate it individually into each system. We were near the breaking point with just two systems and a handful of algorithms. We knew this is able to only worsen as we expanded to more use cases and more researchers.
The net application unlocked the interactivity for our users and validated our direction. Nevertheless, it was not scaling well. Adding latest algos and onboarding latest use cases was still time consuming and required the hassle of too many engineers. These investments in one-to-one integrations were volatile with implementation timelines various from a number of weeks to several months. Because of the bespoke nature of the implementation, we lacked catalog wide searches for all available ML sources.
In summary, this model was a tightly-coupled application-to-data architecture, where machine learning algos were mixed with the backend and UI/UX software code stack. To deal with the variance within the implementation timelines we wanted to standardize how different algorithms were integrated — ranging from how they were executed to creating the info available to all consumers consistently. As we developed more media understanding algos and desired to expand to additional use cases, we wanted to speculate in system architecture redesign to enable researchers and engineers from different teams to innovate independently and collaboratively. Media Search Platform (MSP) is the initiative to handle these requirements.
Although we were just getting began with media-search, search itself is just not latest to Netflix. We’ve a mature and robust search and advice functionality exposed to hundreds of thousands of our subscribers. We knew we could leverage learnings from our colleagues who’re liable for constructing and innovating on this space. Consistent with our “highly aligned, loosely coupled” culture, we desired to enable engineers to onboard and improve algos quickly and independently, while making it easy for Studio and product applications to integrate with the media understanding algo capabilities.
Making the platform modular, pluggable and configurable was key to our success. This approach allowed us to maintain the distributed ownership of the platform. It concurrently provided different specialized teams to contribute relevant components of the platform. We used services already available for other use cases and prolonged their capabilities to support latest requirements.
Next we are going to discuss the system architecture and describe how different modules interact with one another for end-to-end flow.
Netflix engineers strive to iterate rapidly and like the “MVP” (minimum viable product) approach to receive early feedback and minimize the upfront investment costs. Thus, we didn’t construct all of the modules completely. We scoped the pilot implementation to make sure immediate functionalities were unblocked. At the identical time, we kept the design open enough to permit future extensibility. We’ll highlight a number of examples below as we discuss each component individually.
Interfaces – API & Query
Starting at the highest of the diagram, the platform allows apps to interact with it using either gRPC or GraphQL interfaces. Having diversity within the interfaces is important to satisfy the app-developers where they’re. At Netflix, gRPC is predominantly utilized in backend-to-backend communication. With energetic GraphQL tooling provided by our developer productivity teams, GraphQL has change into a de-facto selection for UI — backend integration. You could find more about what the team has built and the way it’s getting utilized in these blog posts. Specifically, we now have been counting on Domain Graph Service Framework for this project.
In the course of the query schema design, we accounted for future use cases and ensured that it is going to allow future extensions. We aimed to maintain the schema generic enough in order that it hides implementation details of the particular search systems which are used to execute the query. Moreover it’s intuitive and straightforward to grasp yet feature wealthy in order that it might be used to specific complex queries. Users have flexibility to perform multimodal search with input being an easy text term, image or short video. As discussed earlier, search might be performed against your entire Netflix catalog, or it might be limited to specific titles. Users may prefer results which are organized ultimately comparable to group by a movie, sorted by timestamp. When there are numerous matches, we allow users to paginate the outcomes (with configurable page size) as an alternative of fetching all or a hard and fast variety of results.
Search Gateway
The client generated input query is first given to the Query processing system. Since most of our users are performing targeted queries comparable to — seek for dialogue “friends don’t lie” (from the above example), today this stage performs lightweight processing and provides a hook to integrate A/B testing. In the long run we plan to evolve it right into a “query understanding system” to support free-form searches to cut back the burden on users and simplify client side query generation.
The query processing modifies queries to match the goal data set. This includes “embedding” transformation and translation. For queries against embedding based data sources it transforms the input comparable to text or image to corresponding vector representation. Each data source or algorithm could use a unique encoding technique so, this stage ensures that the corresponding encoding can also be applied to the provided query. One example why we’d like different encoding techniques per algorithm is because there may be different processing for a picture — which has a single frame while video — which comprises a sequence of multiple frames.
With global expansion we now have users where English is just not a primary language. All the text-based models within the platform are trained using English language so we translate non-English text to English. Although the interpretation is just not at all times perfect it has worked well in our case and has expanded the eligible user base for our tool to non-English speakers.
Once the query is transformed and prepared for execution, we delegate search execution to 1 or more of the searcher systems. First we’d like to federate which query must be routed to which system. That is handled by the Query router and Searcher-proxy module. For the initial implementation we now have relied on a single searcher for executing all of the queries. Our extensible approach meant the platform could support additional searchers, which have already been used to prototype latest algorithms and experiments.
A search may intersect or aggregate the info from multiple algorithms so this layer can fan out a single query into multiple search executions. We’ve implemented a “searcher-proxy” inside this layer for every supported searcher. Each proxy is liable for mapping input query to 1 expected by the corresponding searcher. It then consumes the raw response from the searcher before handing it over to the Results post-processor component.
The Results post-processor works on the outcomes returned by a number of searchers. It might rank results by applying custom scoring, populate search recommendations based on other similar searches. One other functionality we’re evaluating with this layer is to dynamically create different views from the identical underlying data.
For ease of coordination and maintenance we abstracted the query processing and response handling in a module called — Search Gateway.
Searchers
As mentioned above, query execution is handled by the searcher system. The first searcher utilized in the present implementation is named Marken — scalable annotation service built at Netflix. It supports different categories of searches including full text and embedding vector based similarity searches. It might store and retrieve temporal (timestamp) in addition to spatial (coordinates) data. This service leverages Cassandra and Elasticsearch for data storage and retrieval. When onboarding embedding vector data we performed an in depth benchmarking to guage the available datastores. One takeaway here is that even when there may be a datastore that focuses on a selected query pattern, for ease of maintainability and consistency we decided to not introduce it.
We’ve identified a handful of common schema types and standardized how data from different algorithms is stored. Each algorithm still has the flexibleness to define a custom schema type. We’re actively innovating on this space and recently added capability to intersect data from different algorithms. That is going to unlock creative ways of how the info from multiple algorithms will be superimposed on one another to quickly get to the specified results.
Algo Execution & Ingestion
Thus far we now have focused on how the info is queried but, there may be an equally complex machinery powering algorithm execution and the generation of the info. That is handled by our dedicated media ML Platform team. The team focuses on constructing a set of media-specific machine learning tooling. It facilitates seamless access to media assets (audio, video, image and text) along with media-centric feature storage and compute orchestration.
For this project we developed a custom sink that indexes the generated data into Marken in response to predefined schemas. Special care is taken when the info is backfilled for the primary time in order to avoid overwhelming the system with huge amounts of writes.
Last but not the least, our UI team has built a configurable, extensible library to simplify integrating this platform with end user applications. Configurable UI makes it easy to customize query generation and response handling as per the needs of individual applications and algorithms. The long run work involves constructing native widgets to attenuate the UI work even further.
The media understanding platform serves as an abstraction layer between machine learning algos and various applications and features. The platform has already allowed us to seamlessly integrate search and discovery capabilities in several applications. We imagine future work in maturing different parts will unlock value for more use cases and applications. We hope this post has offered insights into how we approached its evolution. We’ll proceed to share our work on this space, so stay tuned.
Do a majority of these challenges interest you? If yes, we’re at all times searching for engineers and machine learning practitioners to hitch us.
Special because of Vinod Uddaraju, Fernando Amat Gil, Ben Klein, Meenakshi Jindal, Varun Sekhri, Burak Bacioglu, Boris Chen, Jason Ge, Tiffany Low, Vitali Kauhanka, Supriya Vadlamani, Abhishek Soni, Gustavo Carmo, Elliot Chow, Prasanna Padmanabhan, Akshay Modi, Nagendra Kamath, Wenbing Bai, Jackson de Campos, Juan Vimberg, Patrick Strawderman, Dawn Chenette, Yuchen Xie, Andy Yao, and Chen Zheng for designing, developing, and contributing to different parts of the platform.
Your article helped me a lot, is there any more related content? Thanks!
water sounds
soothing music