The Challenge of Captioning Video at More Than 1fps

The flexibility for machine learning systems to acknowledge the events that occur inside a video is crucial to the long run of AI-based video generation – not least because video datasets require accurate captions to be able to produce models that adhere to a user’s request, and that don’t excessively hallucinate.

Source: https://sites.google.com/view/vidrecap

Manually captioning the size of videos needed for effective training datasets is an unconscionable prospect. Even though it is feasible to coach AI systems to auto-caption videos, a fantastic many human-generated examples are still needed as ground truth, for variety and coverage.

More importantly, almost every current AI-based video-captioning model , which is just not a dense enough capture rate to discern variations in a fantastic many scenarios: sudden micro-expression changes for emotion-recognition systems; rapid events in high-speed sports equivalent to basketball; violent movements; rapid cuts in dramatic movies, where systems equivalent to PySceneDetect may fail to discover them (or aren’t getting used); and lots of other scenarios where the window of attention clearly must be more intense.

Source: https://www.youtube.com/watch?v=_1PuqKno_Ok

Move Fast and Break Logic

This low rate is the usual for various logistical reasons. For one, video-captioning is a resource-intensive activity, whether the system is studying one sequential frame at a time, or else using various methods to semantically cohere a string of frames into an interpretable caption sequence. In either case, the context window is inevitably limited by hardware constraints.

Another excuse for 1fps being the present standard is that videos aren’t generally full of rapid events; it’s subsequently redundant to present 300 frames of static snooker table the identical attention because the split-second during which a potted black ball wins the championship (see example above).

It is feasible to make use of broader secondary cues to discover pivotal moments in a sports video, equivalent to the sustained crowd response to a rapid slam-dunk in a basketball game. Nonetheless, such clues may occur for other reasons (equivalent to unexpected player injuries), and might’t be relied on. That is one example of how a mislabeled video dataset can result in a generative video model that hallucinates or misinterprets instructions, i.e., since the model might show a player injury when it was asked to generate a slam-dunk (since the ‘secondary clue’ of crowd-agitation was not exclusive to 1 specific sort of event).

That is in some ways a ‘budgetary’ problem, and in other ways a procedural problem. Frameworks thus far have operated on the principle that sparse keyframes can effectively capture essential information, but that is more practical in establishing genre and other facets of a video’s subject material, since evidence, in that case, persists over multiple frames.

F-16

A brand new paper from China is offering an answer, in the shape of the primary multimodal large language model (MLLM, or just LLM) that may analyze video as a substitute of the usual 1fps, while avoiding the foremost pitfalls of accelerating the evaluation rate.

In tests, the authors claim that the brand new system, titled , outperforms proprietary state-of-the-art models equivalent to GPT-4o and Google’s Gemini-1.5 pro. While other current models were in a position to match or exceed F-16’s ends in trials, the competing models were far larger and unwieldier.

Though F-16 was trained on some serious hardware (as we’ll examine shortly), inference is frequently far less demanding than training. Subsequently we will hope that the code (promised for a near-future release) might be able to running on medium or high-level domestic GPUs .

What’s needed for the vitality of the hobbyist scene (and that features the skilled VFX scene, more often than not) is a video-captioning model of this type that may operate, perhaps quantized, on consumer systems, in order that your complete generative video scene doesn’t migrate to API-based business systems, or force consumers to hook local frameworks as much as business online GPU services.

Beyond Scaling Up

The authors observe that this type of approach is a practical alternative to scaling up datasets. One can infer also that when you were going to throw more data at the issue, this remains to be the form of approach that could possibly be preferable, because the brand new system distinguishes events in a more granular way.

They state:

The recent paper is titled Improving , and comes from eight authors across Tsinghua University and ByteDance.

Method

Since consecutive frames often contain redundant information, F-16 applies a high-frame-rate aligner to compress and encode key motion details while retaining visual semantics. Each frame is first processed by a pretrained image encoder, extracting feature representations before being passed to an aligner based on Gaussian Error Linear Units (GELUs).

F-16’s architecture processes video at 16 FPS, capturing more frames than traditional low-frame-rate models, and its high-frame-rate aligner preserves visual semantics while efficiently encoding motion dynamics without adding extra visual tokens. Source: https://arxiv.org/pdf/2503.13956

Source: https://arxiv.org/pdf/2503.13956

To handle the increased frame count efficiently, F-16 groups frames into small processing windows, merging visual features using a three-layer Multi-Layer Perceptron (MLP), helping to retain only probably the most relevant motion details, and reducing unnecessary duplication, while preserving the temporal flow of actions. A spatial max-pooling layer further compresses the token count, keeping computational costs inside bounds.

The processed video tokens are then fed into the Qwen2-7B LLM, which generates textual responses based on the extracted visual features and a given user prompt.

By structuring video input this manner, F-16 enables, the authors assert, more precise event recognition in dynamic scenes, while still maintaining efficiency.

The Short Version

F-16 extends a pretrained image LLM, LLaVA-OneVision, to process video by transforming its visual input pipeline. While standard image LLMs handle isolated frames, F-16’s high-frame-rate aligner reformats multiple frames right into a form the model can more efficiently process; this avoids overwhelming the system with redundant information while preserving key motion cues essential for accurate video understanding.

To make sure compatibility with its image-based foundation, F-16 reuses pretrained parameters by restructuring its aligner into . This approach allows it to integrate knowledge from single-frame models while adapting to sequential video input.

The aligner first compresses frame sequences right into a format optimized for the LLM, preserving probably the most informative features while discarding unnecessary details. The architecture design enables the system to process high-frame-rate video while keeping computational demands under control, which the authors posit as evidence that scaling is just not the one (or the most effective) way forward for video captioning.

Various the Pace

Since processing video at 16 FPS improves motion understanding but increases computational cost, particularly during inference, F-16 introduces a method, allowing it to regulate frame rate dynamically without retraining.

The single-frame and high frame rate aligners available to F-16.

This flexibility enables the model to operate efficiently at lower FPS when high precision isn’t required, and reduces computational overhead.

At test time, when a lower frame rate is chosen, F-16 reuses previously trained aligner parameters by repeating input frames to match the expected dimensions. This ensures the model can still process video effectively without modifying its architecture.

Unlike naive downsampling (i.e., simply removing frames), which risks losing critical motion details, this method preserves the aligner’s learned motion representations, maintaining accuracy even at reduced frame rates. For general video comprehension, a lower FPS setting can speed up inference without significant performance loss, while high-speed motion evaluation can still leverage the total 16 FPS capability.

Data and Tests

Built on Qwen2-7B, FP-16 extends LLaVA-OneVision using SigLIP as a picture encoder. With video frames sampled at 16 FPS, as much as 1,760 frames could be obtained from each video. For longer video clips, frames were uniformly (i.e., more sparsely) sampled.

For training, F-16 used the identical general video datasets as LLaVA-Video, including LLaVA-Video-178K, NExT-QA, ActivityNet-QA, and PerceptionTest.

F-16 was moreover fine-tuned on the high-speed sports datasets FineGym, Diving48, and SoccerNet. The authors also curated a set of 276 NBA games played between November 13 and November 25, 2024, specializing in whether a shot was successful (a task requiring high-frame-rate processing).

The model was evaluated using the NSVA test set, with performance measured by F1 rating.

Gymnastics and diving models were evaluated based on event recognition accuracy, while soccer and basketball models tracked passes and shot outcomes.

The model was trained for 1 epoch using NVIDIA H100 GPUs (and at a standard-issue 80GB of VRAM per GPU, this entailed the usage of 10,24 terabytes of GPU memory; even by recent standards, that is the highest-specced GPU cluster I even have personally come across in maintaining with computer vision research literature). A learning rate of two×10⁻⁵ was used during training.

Moreover, a LoRA was fine-tuned on sports data used LoRA adapters with 64 GPUs for five epochs. Here, only the LLM was trained, leaving the image encoder frozen.

Opposing frameworks tested within the initial round for ‘general video understanding’ were GPT-4o; Gemini-1.5-Pro; Qwen2-VL-7B; VideoLLaMA2-7B; VideoChat2-HD-7B; LLaVA-OV-7B; MiniCPM-V2.6-8B; LLaVA-Video-7B; and NVILA-7B;

The models were evaluated on Video-MME; VideoVista; TemporalBench; MotionBench; Next-QA; MLVU; and LongVideoBench.

Comparison of video QA results across models, showing FPS limits and performance on multiple benchmarks. F-16 achieves SOTA among 7B models on Video-MME, NQA, TPB, and MB, rivaling proprietary models such as GPT-4o and Gemini-1.5-Pro.

Of those results, the authors state:

F-16’s high-frame-rate processing, the authors proceed, also resulted in a 13.5% improvement on TemporalBench and a 2.5% gain on MotionBench, in comparison with existing 7B models, and performed at an identical level to business models equivalent to GPT-4o and Gemini-1.5-Pro.

High Speed Sports Video Understanding

F-16 was tested on FineGym, Diving48, SoccerNet, and NBA datasets to judge its ability to grasp high-speed sports actions.

Using the ten,000 manually annotated NBA clips, the training focused on ball movement and player actions, and whether the models could accurately determine if a shot was successful, using the NSVA test set evaluated with F1 rating.

Results of high-speed sports video analysis. F-16 with the high-frame-rate aligner performed better than its low-frame-rate counterpart across all sports tasks. GPT-4o and Gemini-1.5-Pro were also evaluated on NBA and SoccerNet QA, where in-domain training knowledge was not required.

On FineGym, which measures gymnastics motion recognition, F-16 performed 13.8% higher than the previous 7B SOTA model, demonstrating improved fine-grained motion understanding.

Diving48 required identifying complex movement sequences equivalent to takeoff, , , and phases, and F-16 showed higher accuracy in recognizing these transitions.

For SoccerNet, the model analyzed 10-second clips, identifying ball passes, and results showed an improvement over existing 7B models, indicating that higher FPS contributes to tracking small and rapid movements.

Within the NBA dataset, F-16’s ability to find out shot outcomes approached the accuracy of larger proprietary models equivalent to GPT-4o and Gemini-1.5-Pro, further suggesting that higher frame rates enhances its ability to process dynamic motion.

Variable Frame-Rates

F-16 was tested at different frame rates to measure its adaptability. As an alternative of retraining, it handled lower FPS by repeating frames to match the aligner’s input structure. This approach retained more performance than simply removing (susceptible to cause accuracy loss).

The outcomes indicate that while reducing FPS had some impact on motion recognition, F-16 still outperformed low-frame-rate models and maintained strong results even below 16 FPS.

Left, the time consumption of different F-16 modules during inference, measured on 300 videos from the Video-MME Long set at varying test FPS and sequence lengths. Right, a comparison between Video-MME performance for models trained and tested at different FPS. The solid line represents models trained and tested at the same FPS, while the dashed line shows performance when a model trained at 16 FPS is tested at a lower frame rate.

F-16’s high-frame-rate processing increased computational requirements, although its aligner helped manage these costs by compressing redundant visual tokens.

The model required more FLOPs per video than lower-FPS models, but in addition achieved higher accuracy per token, suggesting that its frame selection and token compression strategies helped offset the added computation.

Conclusion

It’s difficult to overstate either the importance or the challenges of this particular strand of research – especially this 12 months, which is ready to be the breakthrough 12 months for generative video, throwing the shortcomings of video dataset curation and captioning quality into sharp relief.

It also needs to be emphasized that the challenges involved in getting accurate descriptions of internal video details can’t be solved exclusively by throwing VRAM, time, or disk space at the problem. The strategy by which events are isolated/extracted from otherwise long and tedious tracts of video (as with golf or snooker video clips, for example) will profit from a rethink of the semantic approaches and mechanisms currently dominating SOTA solutions – because a few of these limitations were established in additional resource-impoverished times.

The Challenge of Captioning Video at More Than 1fps

Move Fast and Break Logic