How Long Can Your Video Large Multimodal Model Go?

TimeScope is an open-source benchmark designed to measure how well vision-language models understand long videos. By adding short “needle” clips into videos starting from 1 minute to eight hours, it evaluates three skills:

localized retrieval,
information synthesis,
fine-grained temporal perception. Timescope reveals that many state-of-the-art models still struggle with true temporal comprehension.

Recent advances in multimodal AI have produced models claiming to grasp hour-long videos. This trend mirrors progress in long-context language models, which excel at reasoning over lengthy text. Following this, vision-language systems now advertise context windows that may handle hundreds of frames. But these claims require a better look: do these models truly exhibit understanding of the sequence of events? Are they limited to surface-level retrieval recognition? It’s crucial to ask if their capabilities are being overstated.

Text benchmarks resembling HELM and RULER have exposed the fragility of long-context claims, showing that models often struggle when tasks demand greater than easy retrieval, like reasoning or aggregation at long context lengths. Within the video domain, nevertheless, we’re still playing catch-up. Probably the most common test, Video Needle in a Haystack (VideoNIAH), injects static images as “needles” into videos, effectively measuring visual search quite than true temporal dynamics. Because of this, even top-tier models promoting massive frame capacities are rarely trained beyond ~256 frames and see sharp drops on benchmarks like Video-MME when pushed further.

This measurement gap leaves us wondering: What does it really mean for a model to “understand” long videos? To handle this, we’re excited to introduce TimeScope, a brand new open-source benchmark hosted on Hugging Face. TimeScope probes the boundaries of long-video capabilities by inserting several short (~5-10 second) video clips—our “needles”—into base videos starting from 1 minute to eight hours. With three distinct task types, it evaluates not only retrieval but synthesis, localization, and fine-grained motion evaluation, providing a more holistic view of temporal comprehension.

Why TimeScope? Motivating a Higher Benchmark for Video

The promise of long-video AI is transformative — enabling agents to summarize hours of footage, detect subtle anomalies, and answer complex questions on prolonged narratives. Integrated into robotics, these models could analyze prolonged operations, adapt in real time, and push autonomous decision-making. Just as powerful is the vision of a private assistant that understands day by day life and offers continuous, actionable feedback.

In practice, this results in overstated capabilities. Models might claim to process 10,000+ frames, but training data often caps at 256 frames per clip, resulting in degraded performance on longer inputs. We have seen this in evaluations where increasing frame sampling rates tanks accuracy on tasks requiring temporal insight.

TimeScope flips the script by emphasizing three pillars of long-video understanding:

Localized Retrieval: Can the model spot and answer questions on a particular short segment inside an enormous video?
Information Synthesis: Can it gather and order details from multiple points across the timeline?
Wonderful-Grained Temporal Perception: Can it analyze motion and events in needles that demand dense, multi-frame sampling?

Benchmark Design

TimeScope’s key idea is using short video clips as “needles,” and as a substitute of just spotting the needle, it pushes models to deeply understand the entire video. We start with an extended base video (e.g., a documentary, lecture, or ambient footage) and insert a number of hand-curated short video needles (5-10 seconds each) at random positions. These needles contain the important thing information needed to resolve the duty, forcing models to process the complete input without shortcuts like sparse sampling.

Benchmark Design Diagram

Figure 1: Overview of TimeScope’s needle insertion process. An extended base video (1 min to eight hours) serves because the haystack, into which we splice short video needles (~5-10 seconds). Tasks require detecting, synthesizing, or analyzing content from these needles, embedded at various depths.

We evaluate across three needle types, each targeting a distinct aspect of long-video comprehension:

1. Localized Retrieval

This tests basic retrieval and understanding of a localized event. Questions are put in order that sampling a relevant frame from the needle should suffice—like asking a couple of shorter part in an extended video.

Example:
What mode of transportation is shown within the video?

2. Information Synthesis

Here, we embed multiple text-based needles (e.g., 2-4 short clips displaying “secret words” via on-screen text) at different points within the video. The model must discover all words and report them in chronological order, simulating tasks like extracting timestamps or key facts from dispersed scenes. This requires scanning the complete timeline and understanding relative positioning.

3. Wonderful-Grained Temporal Perception

For questions specializing in motion or sequences inside a brief clip, single-frame sampling won’t cut it—the model must perceive dynamics across frames. This probes whether long-context handling preserves temporal fidelity.

Example:
How over and over did the person swing his axe? (a) one (b) two (c) three (d) 4 (e) five (f) six

With different video lengths and ranging needle placements, TimeScope measures how much video a model can really handle—and shows that performance drops because the video gets longer.

Evaluations & Leaderboard

To kick things off, we ran TimeScope on a collection of leading vision-language models, from open-source favorites to the juggernauts like Gemini 2.5-Pro. The outcomes underscore the benchmark’s value: even models that claim to handle long videos well still struggle with real long-video tasks. These findings reveal clear patterns—performance cliffs around certain durations, strengths in static retrieval versus weaknesses in motion evaluation—and pave the way in which for targeted improvements in model training. For detailed results and visualizations, try our Hugging Face Space embedded above.

What did we learn?

Model size isn’t every little thing. Qwen 2.5-VL 3B and 7B, in addition to InternVL 2.5 models at 2B, 4B, and 8B parameters, exhibit nearly indistinguishable long-video curves to their smaller counterparts. All of them plateau at roughly the identical context length, showing that simply scaling parameters doesn’t robotically grant an extended temporal horizon.

Gemini 2.5-Pro is in a league of its own. It’s the only model that maintains strong accuracy on videos longer than one hour.

Trade-offs across tasks matter. Qwen 2.5-VL shines within the Information-Synthesis (OCR) task—identifying and ordering dispersed text snippets—yet it falls behind on Wonderful-Grained Temporal Perception, where precise motion counting is required.

Conclusion – Let’s Raise the Bar for Long-Video AI

TimeScope demonstrates that “hour-long video understanding” remains to be more slogan than reality. By revealing where even state-of-the-art models discover temporal reasoning, information synthesis, and motion perception, the benchmark invites us to rethink how we train and evaluate multimodal systems.

Run the Demo – Explore the general public Space: https://huggingface.co/spaces/Apollo-LMMs/TimeScope

Benchmark Locally – Evaluate any model with two quick commands:

pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
python -m lmms_eval --model-path  --benchmark timescope

Join the Leaderboard – Submit your scores and see how your model compares.

We hope this benchmark helps the community make regular, measurable progress toward models that higher understand video over time.

We’re open-sourcing all components of TimeScope:

Source link

How Long Can Your Video Large Multimodal Model Go?

Table of Contents

Why TimeScope? Motivating a Higher Benchmark for Video