CinePile 2.0

On this blog post we share the journey of releasing CinePile 2.0, a significantly improved version of our long video QA dataset. The improvements in the brand new dataset depend on a brand new approach that we coined adversarial dataset refinement.

We’re excited to share each CinePile 2.0 and our adversarial refinement method implementation, which we consider can strengthen many existing datasets and directly be a part of future dataset creation pipelines.

When you are mainly concerned with the adversarial refinement method, you’ll be able to jump on to the Adversarial Refinement section.

Wait. What’s CinePile?

In May 2024, we launched CinePile, a protracted video QA dataset with about 300,000 training samples and 5,000 test samples.

The primary release stood out from other datasets in two points:

Query diversity: It covers temporal understanding, plot evaluation, character dynamics, setting, and themes.
Query difficulty: In our benchmark, humans outperformed the very best business vision models by 25% and open-source ones by 65%.

Taking a have a look at an information sample

A part of the key sauce behind it’s that it relies on movie clips from YouTube and Q&A distilled from precise audio descriptions designed for visually impaired audiences. These descriptions offer wealthy context beyond basic visuals (e.g., “What color is the automobile?”), helping us create more complex questions.

Tell me more. How did you place together the unique dataset?

To automate query creation, we first built query templates by inspecting existing datasets like MovieQA and TVQA. We clustered the questions in these datasets using a textual similarity model WhereIsAI/UAE-Large-V1 after which prompted GPT-4 with 10 random examples from each cluster to generate a matter template and a prototypical query for every:

Category	Query template	Prototypical query
Character and Relationship Dynamics (CRD)	Interpersonal Dynamics	What changes occur in the connection between person A and person B following a shared experience or actions?
Character and Relationship Dynamics (CRD)	Decision Justification	What reasons did the character give for making their decision?
Narrative and Plot Evaluation (NPA)	Crisis Event	What major event results in the character’s drastic motion?
Narrative and Plot Evaluation (NPA)	Mysteries Unveiled	What secret does character A reveal about event B?
Setting and Technical Evaluation (STA)	Physical Possessions	What’s [Character Name] holding?
Setting and Technical Evaluation (STA)	Environmental Details	What does the [setting/location] appear like [during/at] [specific time/place/event]?
Temporal (TEMP)	Critical Time-Sensitive Actions	What must [Character] do quickly, and what are the results otherwise?
Temporal (Temp)	Frequency	How repeatedly does a personality attempt [action A]?
Thematic Exploration (TH)	Symbolism and Motif Tracking	Are there any symbols or motifs introduced in Scene A that reappear or evolve in Scene B, and what do they signify?
Thematic Exploration (TH)	Thematic Parallels	What does the chaos within the scene parallel by way of the movie’s themes?

Since templates aren’t all the time relevant to each movie clip, we used Gemini 1.0 Pro to pick out probably the most appropriate ones for every scene. Next, we feed a language model the scene’s text, chosen template names (e.g., “Physical Possession”), sample questions, and a system prompt to create scene-specific questions. A well-designed prompt helps the model give attention to the whole scene, generating deeper questions while avoiding superficial ones. We found that:

Providing prototypical examples and including timestamps for dialogues and visual descriptions prevents GPT-4 from hallucinating
This approach results in more plausible multiple-choice query (MCQ) distractors
Asking the model to supply a rationale for its answers improves the standard of the questions

Using this approach, we generate roughly 32 questions per video.
Prior to releasing CinePile, we implemented several mechanisms to make sure the standard of the dataset/benchmark that we cover in the following section.

Inspecting the standard of the primary results

While our process typically generates well-formed, answerable questions, some grow to be trivial or depend on basic concepts that do not require watching the clip. To deal with this, we used several large language models (LLMs) to discover and filter three forms of issues:

Degeneracy Issues
- A matter is taken into account “degenerate” if its answer is apparent from the query itself (e.g., “What’s the color of the pink house?”)
- These comprised only a small portion of our dataset
- Since manual review wasn’t feasible at our scale, we employed three LLMs—Gemini, GPT-3.5, and Phi-1.5—for automated detection
- Questions were excluded from the evaluation set if all three models answered accurately with none context
Vision Reliance Issues
- Some multiple-choice questions may very well be answered using dialogue alone, without requiring visual information
- We used the Gemini model to find out if questions may very well be answered using only dialogue
- Questions received a binary rating: 0 if answerable without visuals, 1 if visual information was required
Difficulty Assessment
- To judge query difficulty, we tested whether models could answer accurately even when given full context (each visual descriptions and subtitles)

Through continued use of the benchmark by our team and the broader community, we identified several areas for improvement that drove us to think about CinePile 2.0.

For CinePile’s second release, we worked along with Hugging Face (following their successful experimentation with fine-tuning Video Llava 7B on CinePile) to discover and prioritize several areas of improvement.

Issues in CinePile 1.0

While the degeneracy filtering was useful in CinePile 1.0, it had several limitations:

Some questions may very well be answered using just the Q&A pairs, without requiring transcripts or visual content
Many flagged questions contained worthwhile insights from the video – reasonably than discarding them, they might have been rephrased to higher capture their value
Degeneracy checks were limited to the test set: running multiple models—especially proprietary ones—was too expensive at scale for CinePile 1.0’s training set

To deal with these issues, we introduced a brand new Adversarial Refinement pipeline that helps improve weak questions reasonably than simply discarding them. This approach could be more easily applied at scale. Throughout this post, we’ll confer with the model(s) that discover degenerate questions (using only the query and answer selections, without visual or dialogue information) because the “Deaf-Blind LLM.”

Adversarial Refinement

The Adversarial Refinement pipeline goals to switch questions or answers until a Deaf-Blind LLM cannot easily predict the right answer. Here’s how it really works:

The Deaf-Blind LLM provides each a solution and a rationale explaining its selection based solely on the query
These rationales help discover implicit cues or biases embedded within the query
Our question-generation model uses this rationale to switch the query and/or answer selections, removing implicit clues
This process repeats as much as five times per query until the Deaf-Blind LLM’s performance drops to random probability

Given the computational demands of this iterative process, we wanted a robust yet accessible LLM that would run locally to avoid API usage limits, delays, and cloud service costs. We selected:

LLaMA 3.1 70B (open-source model) because the Deaf-Blind LLM
GPT-4 for query modification generation

To account for random probability, we:

Tested all five permutations of answer selection order
Marked a matter as degenerate if the model answered accurately in three out of 5 attempts

Results of the adversarial refinement

Briefly, this was the impact of running adversarial refinement in CinePile:

Successfully modified 90.24% of degenerate Q&A pairs within the test set
Manually reviewed unfixable Q&A pairs (~80 out of 800)
- Modified when possible
- Otherwise excluded from evaluation split
Corrected 90.94% of weak pairs within the training set
- Retained unfixable ones as they do not negatively impact performance

Implementation

On this release, we’re publishing each our adversarial refinement pipeline and the code for identifying weak questions. The whole implementation, including all prompts, is on the market in our public repository.

Evaluations

After testing each previously evaluated models and 16 recent Video-LLMs on the modified test set, we’ve highlighted the highest performers within the figure below. Here’s what the outcomes show:

Gemini 1.5 Pro led amongst business Vision Language Models (VLMs)
- Excelled particularly in “Setting and Technical Evaluation”
- Best performance on visually-driven questions on movie environments and character interactions
GPT-based models showed competitive performance
- Strong in “Narrative and Plot Evaluation”
- Performed well on questions on storylines and character interactions
Gemini 1.5 Flash, a lighter version of Gemini 1.5 Pro
- Achieved 58.75% overall accuracy
- Performed particularly well in “Setting and Technical Evaluation”

Open Source models

The open-source video-LLM community has made significant progress from the primary to the present release of CinePile. That is what we learned:

Hard Split

The hard-split ends in CinePile clearly show that current models still lag far behind human capability in understanding visual narratives and story elements. This gap highlights the worth of CinePile’s recent release as a benchmark for measuring progress toward more sophisticated visual understanding.

Leaderboard

We have launched a brand new CinePile Leaderboard that will likely be repeatedly updated as recent models emerge. Visit the space to learn the way to submit your personal models for evaluation.

Source link

CinePile 2.0 – making stronger datasets with adversarial refinement

Wait. What’s CinePile?

Taking a have a look at an information sample

Tell me more. How did you place together the unique dataset?

Inspecting the standard of the primary results