(This post was authored by hlky and Sayak)
Tooling for image generation datasets is well established, with img2dataset being a fundamental tool used for giant scale dataset preparation, and complemented with various community guides, scripts and UIs that cover smaller scale initiatives.
Our ambition is to make tooling for video generation datasets equally established, by creating open video dataset scripts fitted to small scale, and leveraging video2dataset for giant scale use cases.
“If I even have seen further it’s by standing on the shoulders of giants”
On this post, we offer an summary of the tooling we’re developing to make it easy for the community to construct their very own datasets for fine-tuning video generation models. For those who cannot wait to start already, we welcome you to envision out the codebase here.
Table of contents
Tooling
Typically, video generation is conditioned on natural language text prompts reminiscent of: “A cat walks on the grass, realistic style”. Then in a video, there are a variety of qualitative features for controllability and filtering, like so:
- Motion
- Aesthetics
- Presence of watermarks
- Presence of NSFW content
Video generation models are only nearly as good as the information they’re trained on. Subsequently, these features grow to be crucial when curating the datasets for training/fine-tuning.
Our 3 stage pipeline draws inspiration from works like Stable Video Diffusion, LTX-Video, and their data pipelines.
Stage 1 (Acquisition)
Like video2dataset we opt to make use of yt-dlp for downloading videos.
We create a script Video to Scenes to separate long videos into short clips.
Stage 2 (Pre-processing/filtering)
Extracted frames
Entire video
- predict a motion rating with OpenCV
Stage 3 (Processing)
Florence-2 microsoft/Florence-2-large to run Florence-2 tasks
, and on extracted frames. This provides different captions, object recognition and OCR that will be used for filtering in various ways.
We are able to usher in some other captioner on this regard. We may also caption the whole video (e.g., with a model like Qwen2.5) versus captioning individual frames.
Filtering examples
Within the dataset for the model finetrainers/crush-smol-v0, we opted for captions from Qwen2VL and we filtered on pwatermark < 0.1 and aesthetic > 5.5. This highly restrictive filtering resulted in 47 videos out of 1493 total.
Let’s review the instance frames from pwatermark –
Two with text have scores of 0.69 and 0.61
The “toy automobile with a bunch of mice in it” scores 0.60 then 0.17 because the toy automobile is crushed.
All example frames were filtered by pwatermark < 0.1. pwatermark is effective at detecting text/watermarks nevertheless the rating gives no indication whether it’s a text overlay or a toy automobile’s license plate. Our filtering required all scores to be below the edge, a mean across frames could be a simpler strategy for pwatermark with a threshold of around 0.2 – 0.3.
Let’s review the instance frames from aesthetic scores –
The pink castle initially scores 5.5 then 4.44 because it is crushed
The motion figure scores lower at 4.99 dropping to 4.84 because it is crushed.
The shard of glass scores low at 4.04
In our filtering we required all scores to be below the edge, on this case using the aesthetic rating from the primary frame only could be a simpler strategy.
If we review finetrainers/crush-smol we will notice that lots of the objects being crushed are round or rectangular and colourful which has similarities to our findings in the instance frames. Aesthetic scores will be useful yet have a bias that may potentially filter out good data when used with extreme thresholds like > 5.5. It might be simpler as a filter for bad content than good with a minimum threshold of around 4.25 – 4.5.
OCR/Caption
Here we offer some visual examples for every filter in addition to the captions from Florence-2.
| Image | Caption | Detailed Caption |
|---|---|---|
|
A toy automobile with a bunch of mice in it. | The image shows a blue toy automobile with three white mice sitting at the back of it, driving down a road with a green wall within the background. |
| With OCR labels | With OCR and region labels |
|---|---|
|
|
Putting this tooling to make use of 👨🍳
We’ve got created various datasets with the tooling in an try and generate cool video effects, much like the Pika Effects:
We then used these datasets to fine-tune the CogVideoX-5B model using finetrainers. Below is an example output from finetrainers/crush-smol-v0:
Your Turn
We hope this tooling gives you a headstart to create small and high-quality video datasets for your personal custom applications. We are going to proceed so as to add more useful filters to the repository, so, please keep a watch out. Your contributions are also greater than welcome 🤗
Because of Pedro Cuenca for his extensive reviews on the post.










