By Gustavo Carmo, Elliot Chow, Nagendra Kamath, Akshay Modi, Jason Ge, Wenbing Bai, Jackson de Campos, Lingyi Liu, Pablo Delgado, Meenakshi Jindal, Boris Chen, Vi Iyengar, Kelli Griggs, Amir Ziai, Prasanna Padmanabhan, and Hossein Taghavi
In 2007, Netflix began offering streaming alongside its DVD shipping services. Because the catalog grew and users adopted streaming, so did the opportunities for creating and improving our recommendations. With a catalog spanning hundreds of shows and a various member base spanning hundreds of thousands of accounts, recommending the appropriate show to our members is crucial.
Why should members care about any particular show that we recommend? Trailers and artworks provide a glimpse of what to anticipate in that show. We’ve got been leveraging machine learning (ML) models to personalize artwork and to assist our creatives create promotional content efficiently.
Our goal in constructing a media-focused ML infrastructure is to cut back the time from ideation to productization for our media ML practitioners. We accomplish this by paving the trail to:
- and processing (e.g. video, image, audio, and text)
- large-scale models efficiently
- models in a self-serve fashion as a way to execute on existing and newly arriving assets
- and model outputs for consumption in promotional content creation
On this post, we’ll describe a few of the challenges of applying machine learning to media assets, and the infrastructure components that we’ve got built to handle them. We are going to then present a case study of using these components as a way to optimize, scale, and solidify an existing pipeline. Finally, we’ll conclude with a transient discussion of the opportunities on the horizon.
On this section, we highlight a few of the unique challenges faced by media ML practitioners, together with the infrastructure components that we’ve got devised to handle them.
Media Access: Jasper
Within the early days of media ML efforts, it was very hard for researchers to access media data. Even after gaining access, one needed to cope with the challenges of homogeneity across different assets by way of decoding performance, size, metadata, and general formatting.
To streamline this process, we standardized media assets with pre-processing steps that create and store dedicated quality-controlled derivatives with associated snapshotted metadata. As well as, we offer a unified library that allows ML practitioners to seamlessly access video, audio, image, and various text-based assets.
Media feature computation tends to be expensive and time-consuming. Many ML practitioners independently computed similar features against the identical asset of their ML pipelines.
To cut back costs and promote reuse, we’ve got built a feature store as a way to memoize features/embeddings tied to media entities. This feature store is provided with a knowledge replication system that allows copying data to different storage solutions depending on the required access patterns.
Productized models must run over newly arriving assets for scoring. With a purpose to satisfy this requirement, ML practitioners needed to develop bespoke triggering and orchestration components per pipeline. Over time, these bespoke components became the source of many downstream errors and were difficult to keep up.
Amber is a collection of multiple infrastructure components that gives triggering capabilities to initiate the computation of algorithms with recursive dependency resolution.
Media model training poses multiple system challenges in storage, network, and GPUs. We’ve got developed a large-scale GPU training cluster based on Ray, which supports multi-GPU / multi-node distributed training. We precompute the datasets, offload the preprocessing to CPU instances, optimize model operators throughout the framework, and utilize a high-performance file system to resolve the info loading bottleneck, increasing your entire training system throughput 3–5 times.
Media feature values will be optionally synchronized to other systems depending on essential query patterns. Considered one of these systems is Marken, a scalable service used to persist feature values as annotations, that are versioned and strongly typed constructs related to Netflix media entities equivalent to videos and artwork.
This service provides a user-friendly query DSL for applications to perform search operations over these annotations with specific filtering and grouping. Marken provides unique search capabilities on temporal and spatial data by time frames or region coordinates, in addition to vector searches which are capable of scale as much as your entire catalog.
ML practitioners interact with this infrastructure mostly using Python, but there may be a plethora of tools and platforms getting used within the systems behind the scenes. These include, but will not be limited to, Conductor, Dagobah, Metaflow, Titus, Iceberg, Trino, Cassandra, Elastic Search, Spark, Ray, MezzFS, S3, Baggins, FSx, and Java/Scala-based applications with Spring Boot.
The Media Machine Learning Infrastructure is empowering various scenarios across Netflix, and a few of them are described here. On this section, we showcase the usage of this infrastructure through the case study of Match Cutting.
Background
Match Cutting is a video editing technique. It’s a transition between two shots that uses similar visual framing, composition, or motion to fluidly bring the viewer from one scene to the following. It’s a robust visual storytelling tool used to create a connection between two scenes.
In an earlier post, we described how we’ve used machine learning to seek out candidate pairs. On this post, we’ll deal with the engineering and infrastructure challenges of delivering this feature.
Where we began
Initially, we built Match Cutting to seek out matches across a single title (i.e. either a movie or an episode inside a show). A median title has 2k shots, which implies that we want to enumerate and process ~2M pairs.
This complete process was encapsulated in a single Metaflow flow. Each step was mapped to a Metaflow step, which allowed us to regulate the quantity of resources used per step.
We download a video file and produce shot boundary metadata. An example of this data is provided below:
SB = {0: [0, 20], 1: [20, 30], 2: [30, 85], …}
Each key within the SB
dictionary is a shot index and every value represents the frame range corresponding to that shot index. For instance, for the shot with index 1
(the second shot), the worth captures the shot frame range [20, 30]
, where 20
is the beginning frame and 29
is the top frame (i.e. the top of the range is exclusive while the beginning is inclusive).
Using this data, we then materialized individual clip files (e.g. clip0.mp4
, clip1.mp4
, etc) corresponding to every shot in order that they will be processed in Step 2.
This step works with the person files produced in Step 1 and the list of shot boundaries. We first extract a representation (aka embedding) of every file using a video encoder (i.e. an algorithm that converts a video to a fixed-size vector) and use that embedding to discover and take away duplicate shots.
In the next example SB_deduped
is the results of deduplicating SB
:
# the second shot (index 1) was removed and so was clip1.mp4
SB_deduped = {0: [0, 20], 2: [30, 85], …}
SB_deduped
together with the surviving files are passed along to step 3.
We compute one other representation per shot, depending on the flavour of match cutting.
We enumerate all pairs and compute a rating for every pair of representations. These scores are stored together with the shot metadata:
[
# shots with indices 12 and 729 have a high matching score
{shot1: 12, shot2: 729, score: 0.96},
# shots with indices 58 and 419 have a low matching score
{shot1: 58, shot2: 410, score: 0.02},
…
]
Finally, we sort the outcomes by rating in descending order and surface the top-K
pairs, where K
is a parameter.
The issues we faced
This pattern works well for a single flavor of match cutting and finding matches throughout the same title. As we began venturing beyond single-title and added more flavors, we quickly faced a number of problems.
The representations we extract in Steps 2 and Step 3 are sensitive to the characteristics of the input video files. In some cases equivalent to instance segmentation, the output representation in Step 3 is a function of the scale of the input file.
Not having a standardized input file format (e.g. same encoding recipes and dimensions) created matching quality issues when representations across titles with different input files needed to be processed together (e.g. multi-title match cutting).
Segmentation on the shot level is a standard task used across many media ML pipelines. Also, deduplicating similar shots is a standard step that a subset of those pipelines shares.
We realized that memoizing these computations not only reduces waste but in addition allows for congruence between algo pipelines that share the identical preprocessing step. In other words, having a single source of truth for shot boundaries helps us guarantee additional properties for the info generated downstream. As a concrete example, knowing that algo A
and algo B
each used the identical shot boundary detection step, we all know that shot index i
has similar frame ranges in each. Without this data, we’ll have to examine if this is definitely true.
Our stakeholders (i.e. video editors using match cutting) need to begin working on titles as quickly because the video files land. Due to this fact, we built a mechanism to trigger the computation upon the landing of recent video files. This triggering logic turned out to present two issues:
- Lack of standardization meant that the computation was sometimes re-triggered for a similar video file attributable to changes in metadata, with none content change.
- Many pipelines independently developed similar bespoke components for triggering computation, which created inconsistencies.
Moreover, decomposing the pipeline into modular pieces and orchestrating computation with dependency semantics didn’t map to existing workflow orchestrators equivalent to Conductor and Meson out of the box. The media machine learning domain needed to be mapped with some level of coupling between media assets metadata, media access, feature storage, feature compute and have compute triggering, in a way that latest algorithms might be easily plugged with predefined standards.
That is where Amber is available in, offering a Media Machine Learning Feature Development and Productization Suite, gluing all facets of shipping algorithms while permitting the interdependency and composability of multiple smaller parts required to plan a posh system.
Each part is in itself an algorithm, which we call an Amber Feature, with its own scope of computation, storage, and triggering. Using dependency semantics, an Amber Feature will be plugged into other Amber Features, allowing for the composition of a posh mesh of interrelated algorithms.
Step 4 entails a computation that’s quadratic within the variety of shots. As an example, matching across a series with 10 episodes with a median of 2K shots per episode translates into 200M comparisons. Matching across 1,000 files (across multiple shows) would take roughly 200 trillion computations.
Setting aside the sheer variety of computations required momentarily, editors could also be thinking about considering any subset of shows for matching. The naive approach is to pre-compute all possible subsets of shows. Even assuming that we only have 1,000 video files, which means that we’ve got to pre-compute 2¹⁰⁰⁰ subsets, which is greater than the variety of atoms within the observable universe!
Ideally, we wish to make use of an approach that avoids each issues.
The Media Machine Learning Infrastructure provided lots of the constructing blocks required for overcoming these hurdles.
Your entire Netflix catalog is pre-processed and stored for reuse in machine learning scenarios. Match Cutting advantages from this standardization because it relies on homogeneity across videos for correct matching.
Videos are matched on the shot level. Since breaking videos into shots is a quite common task across many algorithms, the infrastructure team provides this canonical feature that will be used as a dependency for other algorithms. With this, we were capable of reuse memoized feature values, saving on compute costs and guaranteeing coherence of shot segments across algos.
We’ve got used Amber’s feature dependency semantics to tie the computation of embeddings to shot deduplication. Leveraging Amber’s triggering, we mechanically initiate scoring for brand spanking new videos as soon because the standardized video encodes are ready. Amber handles the computation within the dependency chain recursively.
We store embeddings in Amber, which guarantees immutability, versioning, auditing, and various metrics on top of the feature values. This also allows other algorithms to be built on top of the Match Cutting output in addition to all of the intermediate embeddings.
We’ve got also used Amber’s synchronization mechanisms to duplicate data from the predominant feature value copies to Marken, which is used for serving.
Used to serve high-scoring pairs to video editors in internal applications via Marken.
The next figure depicts the brand new pipeline using the above-mentioned components:
sleep music