FineVideo: behind the scenes

Open video datasets are scarce and due to this fact slowing down the event of open-source video AI. For that reason we built FineVideo, a dataset with 43k videos that span 3.4k hours and are annotated with wealthy descriptions, narrative details, scene splits, and QA pairs.

FineVideo incorporates a highly diverse collection of videos and metadata which makes it a superb ingredient to coach models to grasp video content, train diffusion models to generate videos from a text description or train computer vision models using its structured data as input.

Wait, you haven’t seen FineVideo yet? take a have a look at it through the dataset explorer page.

About this blog post

On this blog post, we share the technical details and code involved in developing FineVideo: a journey that starts with 1.9M videos in YouTube-Commons and ends with 44K videos with all details annotated.

approach to start is taking a have a look at different steps of our journey. Those steps involve content filtering, annotation and output structuring.

FineVideo video filtering and annotation pipeline

In the next sections we discuss each of the steps and supply references to relevant parts of the code. In case you prefer to navigate the code directly, take a have a look at our FineVideo repository on Github.

First, let’s take a look how we got an initial list of YouTube videos and the way we apply some first filters.

Constructing the Raw dataset

Our journey starts in YouTube-Commons: a group of audio transcripts of videos shared on YouTube under a CC-By license. Such project was created and is currently maintained by PleIAs as a part of their corpus collection projects.

Filtering YouTube-Commons

YouTube Commons contain videos and transcripts in a various set of languages, our initial task is about narrowing down the content of YouTube Commons to the identical language.

We filter YouTube-Commons for videos in English and at the identical time we gather relevant metadata. From this initial filtering, we collect 1.9M videos, their closed captions and metadata.

Below some details on the filters and metadata fields that we keep:

Filters

Field	Filter value	Description
original_language	en	videos in English
transcription_language	en	transcripts in English

Metadata fields

Click to Expand Metadata Fields

Field	Description
acodec	audio codec
age_limit	YouTube age restrictions for the video
categories	YouTube video category
channel	YouTube channel
channel_follower_count	Variety of subscribed users to the channel
channel_id	YouTube channel identifier
character_count	Variety of characters within the closed caption
comment_count	Variety of comments in YouTube
description	YouTube video description
duration_string	Video duration in hh:mm:ss format
license	Video License
like_count	Variety of video likes in YouTube
resolution	Pixel resolution of the video within the format Width x Height
tags	YouTube free text tags related to the video
text	Closed Caption
title	YouTube video title
upload_date	YouTube upload date
vcodec	Video Codec
video_id	YouTube video identifier
view_count	Variety of views in YouTube
word_count	Variety of words within the closed caption

Code for content filtering and metadata gathering available here [link]

Downloading the videos

Once we had a goal video list with 1.9M videos, we managed to successfully download 1.8M videos (among the videos where removed by the channel owners and a few modified their permissions).

We explored two different approaches for distributed downloading.

Option 1: Video2dataset

video2dataset is an open-source project [link] that focuses on distributed video download, transformation and packaging in several dataset formats. The project natively supports Slurm Workload Manager and due to this fact we could run it in our CPU cluster.

Source: Video2Dataset GitHub page

As all our cluster instances face web with the identical public IP, we contributed to the project the likelihood to specify a proxy to facilitate video downloads. While the feature isn’t yet merged, you may patch video2dataset with our PR [link] to make use of the proxy capabilities.

Option 2: Cloud batch jobs

Most cloud providers have the likelihood to run jobs by simply defining the sort of instance that may execute each job, defining a queue and providing a container with the code that can be executed.

We used Google Cloud and AWS to run a custom-made docker container that downloads videos and metadata with ytdlp and pushes the outcomes to S3.

The files to construct the Docker container will be found here [code].

Our conclusion

While Video2Dataset was functional with a proxy and allowed us to do additional processing steps, the requests / second we could do to the proxy became a bottleneck. This made us pivot towards cloud batch jobs.

Keeping dynamic content

In our seek for the very best videos, we narrowed down our selection to content where there’s each visual motion and other people speaking at a mid-fast pace. We achieve this with word density filtering and visual dynamism filtering.

Word density filtering

We took the density of words within the video as a proxy of audio dynamism. The definition of word density being:

Word density = Variety of words in closed captions / Total video length in seconds

By sampling and visually evaluating the standard of the content at different density thresholds, we decided to remove all videos with a word density lower than 0.5 words/second.

Examples:

Word density	Example
0.25
0.5
0.75
1.0

The code to filter by word density and explore examples will be found here [link]

Visual dynamism filtering

We repurposed FFMPEG’s Freezedetect filter to judge the dynamism of the video. While this filter is designed to discover frozen sections of a video (multiple equal frames placed one after the opposite), we could also discover chunks with low movement by exaggerating the noise parameter to a really high value.

Fairly than running freezedetect across the total video, we analyzed the video by temporal segments and we voted if the video was static based on the quantity of segments categorized as static. Through manual evaluation we set a threshold to discard the video if 40% of the segments analyzed have low movement.

Some varieties of content discarded after this filtering:

Type	Example
Static image with music
Presentation screen forged
Highly static people talking to camera

The DockerFile and code to categorise video by its dynamism will be found here [link]

From the 1.8M videos analyzed, after this step we keep 600K dynamic videos. At this stage, we dig deeper into the content of the videos, which can be key to make sure diversity within the dataset.

Video Categorization

As a way to achieve probably the most diverse content selection, we categorized the 600K filtered assets using the closed captioning and YouTube metadata. As a approach to gain control on the categorization, we created a taxonomy and guided the annotation process to stick to the taxonomy.

Custom built Taxonomy

We bootstrapped the custom built taxonomy using GPT4-o and an information scientist reviewed and adjusted it. The taxonomy incorporates 126 positive categories aggregated in multiple levels. This multi-level approach allow users of FineVideo to slice the dataset to suit their particular use-case.

The taxonomy can be available in JSON [link]

With an initial version of the taxonomy we began content annotation and by taking a look at the outcomes of content annotation, with the assistance of an information scientist, we adjusted the taxonomy accordingly.

Content annotation

We categorized the videos using Llama 3.1 70B served through Text Generation Inference TGI [code].

The prompt required multiple iterations to make sure the answer is strictly a category in our taxonomy. During our prompt evaluation we learned that by removing the present YouTube tags and categories from the prompt, the standard of our results increased drastically: YouTube metadata was biasing the text generated by Llama 3.1 towards one among the categories provided by YouTube.

prompt_template = """
Given those categories: {leaves}
Classify a youtube video given its closed captioning and a few metadata details. RETURN ONLY the chosen category and nothing else!
Title: {title}
Description: {description}
Channel: {channel}
Closed Caption: {closed_caption}
"""

Feedback loop taxonomy – content annotation

Taxomy adjustments during content categorization

One in every of the roles of data scientists is to curate taxonomies over-time so as to add latest categories or add some extra degrees of differentiation when needed.

Using LLMs to categorize content stresses the necessity to adjust taxonomies from months / years to hours. Moreover, in some cases, we created categories specifically to discard sensitive videos similar to those falling under Firearms & Weapons and Substance Use & Drugs.

Contributing descriptive metadata

At this point of the method, we now have three sources of video level metadata:

video category (inferred with Llama 3.1)
YouTube Metadata (title, description)
Transcripts from YouTube-Commons

As a way to contribute in the sector of video understanding, we decided to go deeper into timecode-level metadata, for instance activities, objects, narrative and editing facets.
While human annotation was something we regarded as a part of energetic learning setup where a number of models propose annotations and the human does a QA step, as we’ll discuss in the following sections, we present in Gemini a superb solution especially once we constrained the input video length and the output format.

Long videos & Gemini 1.5 Pro

We dig deeper into Gemini 1.5 Pro iterating our prompt and testing it with different content length.

Given its limitation to 1M tokens, which is roughly akin to ~1hour of video, we were forced to drop videos longer than 1 hour.
An idea to beat this example was to speed up videos longer than one hour and that way slot in Gemini’s context.

Exploration: accelerating videos to suit more content in Gemini’s context

While it looked as if it would work at high level, once we began taking a look at the main points we realized that only the primary minutes of the video were accurately annotated.

Finding that quality drops on long videos made us wonder: is that this a problem impacting the remainder of our videos? by sampling videos of various lengths and inspecting the video coverage of the annotations, we found a discount in quality for videos longer than 10+ minutes.

Aligned with our goal to bring prime quality data back to the community, we dropped videos longer than 10+ minutes.

Content selection

Provided that each hour of video costs greater than $5 to annotate with Gemini, we will’t annotate all of the videos that we now have after filtering. Due to this fact, we desired to be sure that we now have a superb coverage over all topics and we search a superb compromise of content diversity for late-pre-training / fine-tuning task and budget. We set this size constraint to 4K hours of video.

As a way to go from 600K videos to 4K hours of content we prepared an algorithm that balances content categories, user engagement, and channel representation to realize the targeted duration.

Algorithm flow diagram

Some key parts of the content selection algorithm:

Activity Rating: We calculate an engagement metric for every video by combining comment, view, and like counts with weighted importance. This rating helps prioritize videos which have resonated well with viewers.
Video Selection: This step iteratively selects videos to fulfill the goal duration while ensuring diversity. It balances between high-engagement content and representation from various categories and channels, using a penalty system to avoid overrepresentation of any single channel.
Final Adjustment: We adjust the choice to match the goal duration as closely as possible without exceeding it. It sorts the chosen videos by duration and adds them to the ultimate list until reaching the closest possible total duration to the goal.

The code will be present in the repository https://huggingface.co/blog/fine-video.

Annotating with Gemini 1.5 Pro and Structured Output with GPT4o

Why structured data?

One in every of our goals with FineVideo is to offer structured data as a approach to empower our community: when you are working on MultiModal LLMs, you may slice the info and choose which categories suit your pre-training or fine-tuning mix. In case you are more into computer vision, you may directly use the dataset to coach classifiers based on the numerical categories included in FineVideo similar to the dynamism rating, scene boundaries or audio/video correlation rating.

Structured data and Gemini 1.5

Gemini 1.5 Pro allows JSON based outputs by providing a schema. We explored this feature and we quickly realized two issues:

We couldn’t fit our original schema into Gemini because our schema is extremely complex
Once we tried with barely simpler schemas -still quite complex- the standard of the Gemini results dropped substantially: a lot of the scene varieties of data (characters, activities, props) dropped. We tried splitting the prompt in multiple prompts and matching different prompts to different parts of the schema without much success.

What we observed completely matched what other researchers experienced: adding concrete schema constraints can decrease performance. (Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models).

Our solution relied on generating free text with Gemini 1.5 and add a second processing step to align the outcomes of Gemini with our schema.

The Gemini prompt that we used is the next:

Study the video and supply the next details concerning the video and the semantic scenes that compose it:

- characterList: a listing of characters that appear in the entire video and a visible description that ought to allow me to discover them just seeing a picture of them.
- scenes: a listing of the scenes with the next properties:
  - start/end timestamps of the scene
  - list of all of the characters that appear within the scene
  - list of all activities and their timestamps
  - list of all props and their timestamps
  - list of all video editing details and their start/end timestamps. Details include transitions, effects, music in addition to suggestions like segments of the scene that might be removed and why 
  - scene mood with notes on how the visuals, audio and context contribute to it. Use the next taxonomy returning only the name in your answer {"moods":{"Positive":[{"name":"Happy","description":"Feeling joyful, content, or delighted."},{"name":"Excited","description":"Feeling enthusiastic, energetic, or eager."},{"name":"Calm","description":"Feeling peaceful, relaxed, or serene."},{"name":"Grateful","description":"Feeling appreciative or thankful."},{"name":"Proud","description":"Feeling satisfied with one's achievements or the achievements of others."}],"Negative":[{"name":"Sad","description":"Feeling down, unhappy, or sorrowful."},{"name":"Angry","description":"Feeling irritated, frustrated, or furious."},{"name":"Anxious","description":"Feeling nervous, worried, or uneasy."},{"name":"Lonely","description":"Feeling isolated, disconnected, or abandoned."},{"name":"Bored","description":"Feeling uninterested, disengaged, or restless."}],"Neutral":[{"name":"Indifferent","description":"Feeling neither particularly positive nor negative."},{"name":"Content","description":"Feeling satisfied but not overly excited."},{"name":"Curious","description":"Feeling interested or inquisitive without strong emotion."},{"name":"Confused","description":"Feeling uncertain or unclear but without strong negative feelings."},{"name":"Pensive","description":"Feeling thoughtful or reflective without strong emotional engagement."}]}}
    - specific  mood changing moments contained in the scene, report the timestamp and what we transition from/to in any of the size (visual / auditive)
  - scene narrative progression and plot development
    - specific narrative moments contained in the scene. Report the timestamp and what happened
  - character interaction and dynamics descriptions and their start/end timestamps
  - specific thematic elements and descriptions
  - specific relevant happenings to create deeper meanings and subtexts not explicitly stated that contribute to the richness and depth of the content, timestamp and descriptions
  - dynamism rating of the scene. Rating between 0 and 1. 1 is extremely dynamic
  - audio - visual correlation rating. Rating between 0 and 1. 0 what we see isn't correlated with the speech and 1 is extremely correlated

- storylines: a listing of different storylines found and which scenes belong to it. 
  - Specify where is the climax (scene and timestamp) and if the content is being presented a narrative story, or is it more like a group of facts or non-narrative information
  - if there are scenes not matching storylines, explain how those scenes contribute to the video
- taking a look at the general video and the storylines, which segments of the video might be trimmed to make it more dynamic?
- q&a: a listing of 5 questions/answers concerning the video that concentrate on positive details (objects and or activities), overall story reasoning and mood. Give attention to Q&A facets captured on the audio and the video every time possible difficult to get only by taking a look at the transcription.

Adding Instructor

Once the result was processed by Gemini, we parsed it with Instructor: a library built on top of Pydantic to realize structured outputs given a schema. See table with example below.

Instructor allowed us to experiment with different models to convert the free text from Gemini into the schema that we defined in Pydantic. We tried Gemini and GPT4o and we sticked with GPT4o given the upper success rate.

Video Gemini Output Instructor Output

Video	Gemini Output	Instructor Output
	`CharacterList: Man Slim construct, brown eyes, shaved sides, black hoodie with colourful logo, black pants. Scenes Scene 1 Start 0:00 End 0:55 Characters: [Man] Activities: Introduces bus Describes peaceful location with cows Props: Bus, cows, deck. Mood:Excited, adventure. Narrative Progression: Introduction to bus. Tour begins outside, highlighting nature and rest. Dynamism Rating 0.7 Audio-Visual Correlation 1`	{ "title": "Bertie the Bus Tour", "description": "Guided tour of converted bus.", "characterList": [ { "name": "Narrator", "description": "Slim build, brown eyes, shaved sides, black hoodie with colorful logo, black pants." } ], "scenes": [ { "sceneId": 1, "title": "Introduction to Bus", "timestamps": { "start": "0:00", "end": "0:55" }, "cast": ["Narrator"], "activities": [ "Narrator speaks in front of bus", "Shows outdoor deck with chairs, cows nearby." ], "props": ["Bus", "Deck", "Cows"], "mood": "Excited, adventure." } ], "dynamismScore": 0.7, "audioVisualCorrelation": 1 }

CharacterList: Man Slim construct, brown eyes, shaved sides, black hoodie with colourful logo, black pants. Scenes Scene 1 Start 0:00 End 0:55 Characters: [Man] Activities: Introduces bus Describes peaceful location with cows Props: Bus, cows, deck. Mood:Excited, adventure. Narrative Progression: Introduction to bus. Tour begins outside, highlighting nature and rest. Dynamism Rating 0.7 Audio-Visual Correlation 1


{
  "title": "Bertie the Bus Tour",
  "description": "Guided tour of converted bus.",
  "characterList": [
    {
      "name": "Narrator",
      "description": "Slim build, brown eyes, shaved sides, black hoodie with colorful logo, black pants."
    }
  ],
  "scenes": [
    {
      "sceneId": 1,
      "title": "Introduction to Bus",
      "timestamps": {
        "start": "0:00",
        "end": "0:55"
      },
      "cast": ["Narrator"],
      "activities": [
        "Narrator speaks in front of bus",
        "Shows outdoor deck with chairs, cows nearby."
      ],
      "props": ["Bus", "Deck", "Cows"],
      "mood": "Excited, adventure."
    }
  ],
  "dynamismScore": 0.7,
  "audioVisualCorrelation": 1
}

It’s value highlighting that the content filtering in Gemini dropped some videos as that is something that may occur to you when you use Gemini. In our case, given the quantity of content that we were targetting, the entire minutes of content that were dropped by Gemini’s filtering was negligible.

The total code to annotate video will be found here [link].

Positive Alignment and anomaly filtering

With the videos annotated and the info properly aligned to our schema, we have a look at the temporal domain of the info and we ensure its alignment with the video: Gemini 1.5 reads video at 1 frame per second and quite ceaselessly videos have 25 – 29 frames per second. In our Positive Alignment we be sure scene boundaries provided by Gemini 1.5 match the right frames within the video.

We also use this temporal alignment to discard cases were Gemini stopped providing useful data and a component of the video is wrongly annotated. Notice that because of dropping all content longer than 10+ minutes earlier within the pipeline, the variety of videos with bad quality data was negligible (lower than 0.5%).

Positive metadata – video scene boundary to shot alignment as a mechanism to discard outliers

Link to video alignment code here [link]

Future Work

We’re currently preparing the training of a multi-modal LLM trained with FineVideo, we plan to share the model weights and training recipe with the community as soon because it is accomplished.

We’re also open to other extensions of FineVideo, speak up and tell us what you desire to to see!

Source link

FineVideo: behind the scenes

Table of Contents

About this blog post

Constructing the Raw dataset

Filtering YouTube-Commons

Downloading the videos

Keeping dynamic content

Word density filtering

Visual dynamism filtering

Video Categorization

Custom built Taxonomy

Content annotation

Feedback loop taxonomy – content annotation

Contributing descriptive metadata

Long videos & Gemini 1.5 Pro

Content selection

Some key parts of the content selection algorithm:

Annotating with Gemini 1.5 Pro and Structured Output with GPT4o

Positive Alignment and anomaly filtering

Future Work

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Featured video: Coding for underwater robotics

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM

Coding the Pong Game from Scratch in Python

Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints

Stop Asking if a Model Is Interpretable

FineVideo: behind the scenes

Table of Contents

About this blog post

Constructing the Raw dataset

Filtering YouTube-Commons

Downloading the videos

Keeping dynamic content

Word density filtering

Visual dynamism filtering

Video Categorization

Custom built Taxonomy

Content annotation

Feedback loop taxonomy – content annotation

Contributing descriptive metadata

Long videos & Gemini 1.5 Pro

Content selection

Some key parts of the content selection algorithm:

Annotating with Gemini 1.5 Pro and Structured Output with GPT4o

Positive Alignment and anomaly filtering

Future Work

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.