By Gustavo Carmo, Elliot Chow, Nagendra Kamath, Akshay Modi, Jason Ge, Wenbing Bai, Jackson de Campos, Lingyi Liu, Pablo Delgado, Meenakshi Jindal, Boris Chen, Vi Iyengar, Kelli Griggs, Amir Ziai, Prasanna Padmanabhan, and Hossein Taghavi
In 2007, Netflix began offering streaming alongside its DVD shipping services. Because the catalog grew and users adopted streaming, so did the opportunities for creating and improving our recommendations. With a catalog spanning 1000’s of shows and a various member base spanning hundreds of thousands of accounts, recommending the best show to our members is crucial.
Why should members care about any particular show that we recommend? Trailers and artworks provide a glimpse of what to anticipate in that show. We’ve been leveraging machine learning (ML) models to personalize artwork and to assist our creatives create promotional content efficiently.
Our goal in constructing a media-focused ML infrastructure is to cut back the time from ideation to productization for our media ML practitioners. We accomplish this by paving the trail to:
- and processing (e.g. video, image, audio, and text)
- large-scale models efficiently
- models in a self-serve fashion as a way to execute on existing and newly arriving assets
- and model outputs for consumption in promotional content creation
On this post, we are going to describe a number of the challenges of applying machine learning to media assets, and the infrastructure components that we’ve got built to handle them. We’ll then present a case study of using these components as a way to optimize, scale, and solidify an existing pipeline. Finally, we’ll conclude with a transient discussion of the opportunities on the horizon.
On this section, we highlight a number of the unique challenges faced by media ML practitioners, together with the infrastructure components that we’ve got devised to handle them.
Media Access: Jasper
Within the early days of media ML efforts, it was very hard for researchers to access media data. Even after gaining access, one needed to cope with the challenges of homogeneity across different assets when it comes to decoding performance, size, metadata, and general formatting.
To streamline this process, we standardized media assets with pre-processing steps that create and store dedicated quality-controlled derivatives with associated snapshotted metadata. As well as, we offer a unified library that permits ML practitioners to seamlessly access video, audio, image, and various text-based assets.
Media feature computation tends to be expensive and time-consuming. Many ML practitioners independently computed an identical features against the identical asset of their ML pipelines.
To scale back costs and promote reuse, we’ve got built a feature store as a way to memoize features/embeddings tied to media entities. This feature store is provided with an information replication system that permits copying data to different storage solutions depending on the required access patterns.
Productized models must run over newly arriving assets for scoring. In an effort to satisfy this requirement, ML practitioners needed to develop bespoke triggering and orchestration components per pipeline. Over time, these bespoke components became the source of many downstream errors and were difficult to keep up.
Amber is a collection of multiple infrastructure components that gives triggering capabilities to initiate the computation of algorithms with recursive dependency resolution.
Media model training poses multiple system challenges in storage, network, and GPUs. We’ve developed a large-scale GPU training cluster based on Ray, which supports multi-GPU / multi-node distributed training. We precompute the datasets, offload the preprocessing to CPU instances, optimize model operators inside the framework, and utilize a high-performance file system to resolve the info loading bottleneck, increasing your entire training system throughput 3–5 times.
Media feature values may be optionally synchronized to other systems depending on crucial query patterns. Certainly one of these systems is Marken, a scalable service used to persist feature values as annotations, that are versioned and strongly typed constructs related to Netflix media entities comparable to videos and artwork.
This service provides a user-friendly query DSL for applications to perform search operations over these annotations with specific filtering and grouping. Marken provides unique search capabilities on temporal and spatial data by time frames or region coordinates, in addition to vector searches which can be capable of scale as much as your entire catalog.
ML practitioners interact with this infrastructure mostly using Python, but there’s a plethora of tools and platforms getting used within the systems behind the scenes. These include, but aren’t limited to, Conductor, Dagobah, Metaflow, Titus, Iceberg, Trino, Cassandra, Elastic Search, Spark, Ray, MezzFS, S3, Baggins, FSx, and Java/Scala-based applications with Spring Boot.
The Media Machine Learning Infrastructure is empowering various scenarios across Netflix, and a few of them are described here. On this section, we showcase using this infrastructure through the case study of Match Cutting.
Background
Match Cutting is a video editing technique. It’s a transition between two shots that uses similar visual framing, composition, or motion to fluidly bring the viewer from one scene to the subsequent. It’s a strong visual storytelling tool used to create a connection between two scenes.
In an earlier post, we described how we’ve used machine learning to seek out candidate pairs. On this post, we are going to give attention to the engineering and infrastructure challenges of delivering this feature.
Where we began
Initially, we built Match Cutting to seek out matches across a single title (i.e. either a movie or an episode inside a show). A mean title has 2k shots, which implies that we want to enumerate and process ~2M pairs.
This complete process was encapsulated in a single Metaflow flow. Each step was mapped to a Metaflow step, which allowed us to manage the quantity of resources used per step.
We download a video file and produce shot boundary metadata. An example of this data is provided below:
SB = {0: [0, 20], 1: [20, 30], 2: [30, 85], …}
Each key within the SB
dictionary is a shot index and every value represents the frame range corresponding to that shot index. For instance, for the shot with index 1
(the second shot), the worth captures the shot frame range [20, 30]
, where 20
is the beginning frame and 29
is the top frame (i.e. the top of the range is exclusive while the beginning is inclusive).
Using this data, we then materialized individual clip files (e.g. clip0.mp4
, clip1.mp4
, etc) corresponding to every shot in order that they may be processed in Step 2.
This step works with the person files produced in Step 1 and the list of shot boundaries. We first extract a representation (aka embedding) of every file using a video encoder (i.e. an algorithm that converts a video to a fixed-size vector) and use that embedding to discover and take away duplicate shots.
In the next example SB_deduped
is the results of deduplicating SB
:
# the second shot (index 1) was removed and so was clip1.mp4
SB_deduped = {0: [0, 20], 2: [30, 85], …}
SB_deduped
together with the surviving files are passed along to step 3.
We compute one other representation per shot, depending on the flavour of match cutting.
We enumerate all pairs and compute a rating for every pair of representations. These scores are stored together with the shot metadata:
[
# shots with indices 12 and 729 have a high matching score
{shot1: 12, shot2: 729, score: 0.96},
# shots with indices 58 and 419 have a low matching score
{shot1: 58, shot2: 410, score: 0.02},
…
]
Finally, we sort the outcomes by rating in descending order and surface the top-K
pairs, where K
is a parameter.
The issues we faced
This pattern works well for a single flavor of match cutting and finding matches inside the same title. As we began venturing beyond single-title and added more flavors, we quickly faced a couple of problems.
The representations we extract in Steps 2 and Step 3 are sensitive to the characteristics of the input video files. In some cases comparable to instance segmentation, the output representation in Step 3 is a function of the size of the input file.
Not having a standardized input file format (e.g. same encoding recipes and dimensions) created matching quality issues when representations across titles with different input files needed to be processed together (e.g. multi-title match cutting).
Segmentation on the shot level is a standard task used across many media ML pipelines. Also, deduplicating similar shots is a standard step that a subset of those pipelines shares.
We realized that memoizing these computations not only reduces waste but in addition allows for congruence between algo pipelines that share the identical preprocessing step. In other words, having a single source of truth for shot boundaries helps us guarantee additional properties for the info generated downstream. As a concrete example, knowing that algo A
and algo B
each used the identical shot boundary detection step, we all know that shot index i
has an identical frame ranges in each. Without this information, we’ll have to examine if this is definitely true.
Our stakeholders (i.e. video editors using match cutting) need to begin working on titles as quickly because the video files land. Subsequently, we built a mechanism to trigger the computation upon the landing of latest video files. This triggering logic turned out to present two issues:
- Lack of standardization meant that the computation was sometimes re-triggered for a similar video file resulting from changes in metadata, with none content change.
- Many pipelines independently developed similar bespoke components for triggering computation, which created inconsistencies.
Moreover, decomposing the pipeline into modular pieces and orchestrating computation with dependency semantics didn’t map to existing workflow orchestrators comparable to Conductor and Meson out of the box. The media machine learning domain needed to be mapped with some level of coupling between media assets metadata, media access, feature storage, feature compute and have compute triggering, in a way that latest algorithms may very well be easily plugged with predefined standards.
That is where Amber is available in, offering a Media Machine Learning Feature Development and Productization Suite, gluing all facets of shipping algorithms while permitting the interdependency and composability of multiple smaller parts required to plan a fancy system.
Each part is in itself an algorithm, which we call an Amber Feature, with its own scope of computation, storage, and triggering. Using dependency semantics, an Amber Feature may be plugged into other Amber Features, allowing for the composition of a fancy mesh of interrelated algorithms.
Step 4 entails a computation that’s quadratic within the variety of shots. For example, matching across a series with 10 episodes with a mean of 2K shots per episode translates into 200M comparisons. Matching across 1,000 files (across multiple shows) would take roughly 200 trillion computations.
Setting aside the sheer variety of computations required momentarily, editors could also be considering considering any subset of shows for matching. The naive approach is to pre-compute all possible subsets of shows. Even assuming that we only have 1,000 video files, which means we’ve got to pre-compute 2¹⁰⁰⁰ subsets, which is greater than the variety of atoms within the observable universe!
Ideally, we would like to make use of an approach that avoids each issues.
The Media Machine Learning Infrastructure provided most of the constructing blocks required for overcoming these hurdles.
Your complete Netflix catalog is pre-processed and stored for reuse in machine learning scenarios. Match Cutting advantages from this standardization because it relies on homogeneity across videos for correct matching.
Videos are matched on the shot level. Since breaking videos into shots is a quite common task across many algorithms, the infrastructure team provides this canonical feature that may be used as a dependency for other algorithms. With this, we were capable of reuse memoized feature values, saving on compute costs and guaranteeing coherence of shot segments across algos.
We’ve used Amber’s feature dependency semantics to tie the computation of embeddings to shot deduplication. Leveraging Amber’s triggering, we robotically initiate scoring for brand new videos as soon because the standardized video encodes are ready. Amber handles the computation within the dependency chain recursively.
We store embeddings in Amber, which guarantees immutability, versioning, auditing, and various metrics on top of the feature values. This also allows other algorithms to be built on top of the Match Cutting output in addition to all of the intermediate embeddings.
We’ve also used Amber’s synchronization mechanisms to duplicate data from the predominant feature value copies to Marken, which is used for serving.
Used to serve high-scoring pairs to video editors in internal applications via Marken.
The next figure depicts the brand new pipeline using the above-mentioned components:
vacation mood music