Preparing Video Data for Deep Learning: Introducing Vid Prepper

-

to preparing videos for machine learning/deep learning. As a consequence of the scale and computational cost of video data, it’s vital that it’s processed in as efficient a way possible to your use case. This includes things like metadata evaluation, standardization, augmentation, shot and object detection, and tensor loading. This text explores some ways how these could be done and why we might do them. I actually have also built an open source Python package called vid-prepper. I built the package with the aim of providing a quick and efficient method to apply different preprocessing techniques to your video data. The package builds off some giants of the machine learning and deep learning World, so whilst this package is helpful in bringing them together in a typical and straightforward to make use of framework, the actual work is most definitely on them!

Video has been a crucial a part of my profession. I began my data profession in an organization that built a SaaS platform for video analytics for major leading video firms (called NPAW) and currently work for the BBC. Video currently dominates the net landscape, but with AI continues to be pretty limited, although growing superfast. I desired to create something that helps speed up people’s ability to try things out and contribute to this really interesting area. This text will discuss what the several package modules do and methods to use them, starting with metadata evaluation.

Metadata Evaluation

from vid_prepper import metadata

On the BBC, I’m pretty fortunate to work at knowledgeable organisation with hugely talented people creating broadcast quality videos. Nonetheless, I do know that almost all video data will not be this. Often files will likely be mixed formats, colors, sizes, or they could be corrupted or have parts missing, they may have quirks from older videos, like interlacing. It will be significant to pay attention to any of this before processing videos for machine learning.

We will likely be training our models on GPUs, and these are implausible for tensor calculations at scale but expensive to run. When training large models on GPUs, we would like to be as efficient as possible to avoid high costs. If we’ve got corrupted videos or videos in unexpected or unsupported formats it can waste time and resources, could make your models less accurate and even cause the training pipeline to interrupt. Due to this fact, checking and filtering your files beforehand is a necessity.

Metadata Evaluation is sort of at all times a crucial first step in preparing video data (image source – Pexels)

I actually have built the metadata evaluation module on the ffprobe library, a part of the FFmpeg library in-built C and Assembler. It is a hugely powerful and efficient library used extensively within the occupation and the module could be used to analyse a single video file or a batch of them as shown within the code below.

# Extract metadata
video_path = [“sample.mp4”]
video_info = metadata.Metadata.validate_videos(video_path)

# Extract metadata batch
video_paths = [“sample1.mp4”, “sample2.mp4”, “sample3.mp4”]
video_info = metadata.Metadata.validate_videos(video_paths)

This provides a dictionary output of the video metadata including codecs, sizes, frame rates, duration, pixel formats, audio metadata and more. This is absolutely useful each for locating video data with issues or odd quirks, or also for choosing specific video data or selecting the formats and codec to standardize to based on probably the most commonly used ones.

Filtering Based on Metadata Issues

Given this gave the impression to be a fairly regular use case, I in-built the power to filter the list of videos based on a set of checks. For instance, if there’s video or audio missing, codecs or formats not as specified, or frame rates or durations different to those specified, then these videos could be identified by setting the filter and only_errors parameters, as shown below.

# Run tests on videos
videos = ["video1.mp4", "video2.mkv", "video3.mov"]

all_filters_with_params = {
    "filter_missing_video": {},
    "filter_missing_audio": {},
    "filter_variable_framerate": {},
    "filter_resolution": {"min_width": 1280, "min_height": 720},
    "filter_duration": {"min_seconds": 5.0},
    "filter_pixel_format": {"allowed": ["yuv420p", "yuv422p"]},
    "filter_codecs": {"allowed": ["h264", "hevc", "vp9", "prores"]}
}

errors = Metadata.validate_videos(
    videos,
    filters=all_filters_with_params,
    only_errors=True
)

By removing or identifying issues with the info before we get to the actual intensive work of model training means we avoid wasting money and time, making it an important first step.

Standardization

from vid_prepper import standardize

Standardization is generally pretty vital in preprocessing for video machine learning. It will possibly help make things rather more efficient and consistent, and sometimes deep learning models require specific sizes (eg. 224 x 224). If you’ve got plenty of video data then any time spent on this stage is usually repaid persistently within the training stage afterward.

Standardizing video data could make processing much, rather more efficient and provides higher results (image source – Pexels)

Codecs

Videos are sometimes structured for efficient storage and distribution over the web in order that they could be broadcast cheaply and quickly. This normally involves heavy compression to make videos as small as possible. Unfortunately, that is just about diametrically against what is nice for deep learning. 

The bottleneck for deep learning is sort of at all times decoding videos and loading them to tensors, so the more compressed a video file is, the longer that takes. This typically means avoiding ultra compressed codecs like H265 and VVC and going for lighter compressed alternatives with hardware acceleration like H264 or VP9, or so long as you’ll be able to avoid I/O bottlenecks, using something like uncompressed MJPEG which tends to be utilized in production as it’s the fastest way of loading frames into tensors.

Frame Rate

The usual frame rates (FPS) for video are 24 for cinema, 30 for TV and online content and 60 for fast motion content. These frame rates are determined by the variety of images required to be shown per second in order that our eyes see one smooth motion. Nonetheless, deep learning models don’t necessarily need as high a frame rate within the training videos to create numeric representations of motion and generate smooth looking videos. As every frame is a further tensor to compute, we would like to attenuate the frame rate to the smallest we are able to get away with.

Several types of videos and the use case of our models will determine how low we are able to go. The less motion in a video, the lower we are able to set the input frame rate without compromising the outcomes. For instance, an input dataset of studio news clips or talk shows goes to require a lower frame rate than a dataset made up of ice hockey matches. Also, if we’re working on a video understanding or video-to-text model, moderately than generating video for human consumption, it may be possible to set the frame rate even lower.

Calculating Minimum Frame Rate

It is definitely possible to mathematically determine a fairly good minimum frame rate to your video dataset based on motion statistics. Using a RAFT or Farneback algorithm on a sample of your dataset, you’ll be able to calculate the optical flow per pixel for every frame change. This provides the horizontal and vertical displacement for every pixel to calculate the magnitude of the change (the square root of adding the squared values).

Averaging this value over the frame gives the frame momentum and taking the median and ninety fifth percentile of all of the frames gives values that you could plug into the equation below to get a spread of likely optimal minimum frame rates to your training data.

Optimal FPS (Lower) = Current FPS x Max model interpolation rate / Median momentum

Optimal FPS (Higher) = Current FPS x Max model interpolation rate / ninety fifth percentile momentum

Where max model interpolation is the utmost per frame momentum the model can handle, normally provided within the model card.

Figuring out momentum is nothing greater than a little bit of Pythagoras. No PHD maths here! Source – Pexels

You possibly can then run small scale tests of your training pipeline to find out the bottom frame rate you’ll be able to achieve for optimal performance.

Vid Prepper

The standardize module in vid-prepper can standardize the scale, codec, color format and frame rate of a single video or batch of videos.

Again, it’s built on FFmpeg and has the power to speed up things on GPU if that is accessible to you. To standardize videos, you’ll be able to simply run the code below.

# Standardize batch of videos
video_file_paths = [“sample1.mp4”, “sample2.mp4”, “sample3.mp4”]
standardizer = standardize.VideoStandardizer(
            size="224x224",
            fps=16,
            codec="h264",
            color="rgb",
            use_gpu=False  # Set to True if you've got CUDA
        )

standardizer.batch_standardize(videos=video_file_paths, output_dir="videos/")

With a view to make things more efficient, especially when you are using expensive GPUs and don’t want an IO bottleneck from loading videos, the module also accepts webdatasets. These could be loaded similarly to the next code:

# Standardize webdataset
standardizer = standardize.VideoStandardizer(
            size="224x224",
            fps=16,
            codec="h264",
            color="rgb",
            use_gpu=False  # Set to True if you've got CUDA
        )

standardizer.standardize_wds("dataset.tar", key="mp4", label="cls")

Tensor Loader

from vid_prepper import loader

A video tensor is often 4 or 5 dimensions, consisting of the pixel color (normally RGB), height and width of the frame, time and batch (optional) components. As mentioned above, decoding videos into tensors is usually the most important bottleneck within the preprocessing pipeline, so the steps taken up up to now make a giant difference in how efficiently we are able to load our tensors.

This module converts videos into PyTorch tensors using FFmpeg for frame sampling and NVDec to permit for GPU acceleration. You possibly can alter the scale of the tensors to suit your model together with choosing the variety of frames to sample per clip and the frame stride (spacing between the frames). As with standardization, the choice to make use of webdatasets can be available. The code below gives an example on how this is completed.

# Load clips into tensors
loader = VideoLoader(num_frames=16, frame_stride=2, size=(224,224), device="cuda")
video_paths = ["video1.mp4", "video2.mp4", "video3.mp4"]
batch_tensor = loader.load_files(video_paths)

# Load webdataset into tensors
wds_path = "data/shards/{00000..00009}.tar"
dataset = loader.load_wds(wds_path, key="mp4", label="cls")

Detector

from vid_prepper import detector

It is usually a essential a part of video preprocessing to detect things inside the video content. These may be particular objects, shots or transitions. This module brings together powerful processes and models from PySceneDetector, HuggingFace, Idea Research and PyTorch to supply efficient detection.

Video detection is usually a useful way of splitting videos into clips and getting only the clips you would like to your model (image source – Pexels)

Shot Detection

In lots of video machine learning use cases (eg. semantic search, seq2seq trailer generation and lots of more), splitting videos into individual shots is a crucial step. There are a couple of ways of doing this, but PySceneDetect is certainly one of the more accurate and reliable ways of doing this. This library provides a wrapper for PySceneDetect’s content detection method by calling the next method. It outputs the beginning and end frames for every shot.

# Detect shots in videos
video_path = "video.mp4"
detector = VideoDetector(device="cuda")
shot_frames = detector.detect_shots(video_path)

Transition Detection

Whilst PySceneDetect is a powerful tool for splitting up videos into individual scenes, it will not be at all times 100% accurate. There are occasions where it’s possible you’ll find a way to benefit from repeated content (eg. transitions) breaking up shots. For instance, BBC News has an upwards red and white wipe transition between segments that may easily be detected using something like PyTorch.

Transition detection works directly on tensors by detecting pixel changes in blocks of pixels exceeding a certain threshold change that you could set. The instance code below shows how it really works.

# Detect gradual transitions/wipes
video_path = "video.mp4"
video_loader = loader.VideoLoader(num_frames=16, 
                                  frame_stride=2, 
                                  size=(224, 224), 
                                  device="cpu",
                                  use_nvdec=False  # Use "cuda" if available)
video_tensor = loader.load_file(video_path)

detector = VideoDetector(device="cpu" # or cuda)
wipe_frames = detector.detect_wipes(video_tensor, 
                                    block_grid=(8,8), 
                                    threshold=0.3)

Object Detection

Object detection is usually a requirement to finding the clips you would like in your video data. For instance, it’s possible you’ll require clips with people in them or animals. This method uses an open source Dino model against a small set of objects from the usual COCO dataset labels for detecting objects. Each the model selection and list of objects are completely customisable and could be set by you. The model loader is the HuggingFace transformers package so the model you employ will have to be available there. For custom labels, the default model takes a string with the next structure within the text_queries parameter – “dog. cat. ambulance.”

# Detect objects in videos
video_path = "video.mp4"
video_loader = loader.VideoLoader(num_frames=16, 
                                  frame_stride=2, 
                                  size=(224, 224), 
                                  device="cpu",
                                  use_nvdec=False  # Use "cuda" if available)
video_tensor = loader.load_file(video_path)

detector = VideoDetector(device="cpu" # or cuda)
results = detector.detect_objects(video, 
                                  text_queries=text_queries # if None will default to COCO list, 
                                  text_threshold=0.3, 
                                  model_id=”IDEA-Research/grounding-dino-tiny”)

Data Augmentation

Things like Video Transformers are incredibly powerful and could be used to create great recent models. Nonetheless, they often require an enormous amount of information which isn’t necessarily easily available with things like video. In those cases, we want a method to generate varied data that stops our models overfitting. Data Augmentation is one such solution to assist boost limited data availability.

For video, there are numerous standard methods for augmenting the info and most of those are supported by the main frameworks. Vid-prepper brings together two of the perfect – Kornia and Torchvision. With vid-prepper, you’ll be able to perform individual augmentations like cropping, flipping, mirroring, padding, gaussian blurring, adjusting brightness, color, saturation and contrast, and coarse dropout (where parts of the video frame are masked). You may also chain them together for higher efficiency.

Augmentations all work on the video tensors moderately than directly on the videos and support GPU acceleration if you’ve got it. The instance code below shows methods to call the methods individually and methods to chain them.

# Individual Augmentation Example
video_path = "video.mp4"
video_loader = loader.VideoLoader(num_frames=16, 
                                  frame_stride=2, 
                                  size=(224, 224), 
                                  device="cpu",use_nvdec=False  # Use "cuda" if available)
video_tensor = loader.load_file(video_path)

video_augmentor = augmentor.VideoAugmentor(device="cpu", use_gpu=False)
cropped = augmentor.crop(video_tensor, type="center", size=(200, 200))
flipped = augmentor.flip(video_tensor, type="horizontal")
brightened = augmentor.brightness(video_tensor, amount=0.2)


# Chained Augmentations
augmentations = [
            ('crop', {'type': 'random', 'size': (180, 180)}),
            ('flip', {'type': 'horizontal'}),
            ('brightness', {'amount': 0.1}),
            ('contrast', {'amount': 0.1})
        ]
        
chained_result = augmentor.chain(video_tensor, augmentations)

Summing Up

Video preprocessing is hugely vital in deep learning as a consequence of the relatively huge size of the info in comparison with text. Transformer model requirements for oceans of information compound this even further. Three key elements make up the deep learning process – time, money and performance. By optimizing our input video data, we are able to minimize the quantity of the primary two elements we want to get the perfect out of the ultimate one.

There are some amazing open source tools available for Video Machine Learning, with more coming along day-after-day currently. Vid-prepper stands on the shoulders of a few of the perfect and most generally utilized in an try to try and produce them together in a simple to make use of package. Hopefully you discover some value in it and it lets you create the following generation of video models, which is incredibly exciting!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x