Teaching AI to Give Higher Video Critiques

While Large Vision-Language Models (LVLMs) could be useful aides in interpreting a number of the more arcane or difficult submissions in computer vision literature, there’s one area where they’re hamstrung: determining the merits and subjective quality of any that accompany latest papers*.

This can be a critical aspect of a submission, since scientific papers often aim to generate excitement through compelling text or visuals – or each.

But within the case of projects that involve video synthesis, authors must show actual video output or risk having their work dismissed; and it’s in these demonstrations that the gap between daring claims and real-world performance most frequently becomes apparent.

I Read the Book, Didn’t See the Movie

Currently, many of the popular API-based Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) won’t engage in directly analyzing video content , qualitative or otherwise. As a substitute, they’ll only analyze related transcripts – and, perhaps, comment threads and other strictly -based adjunct material.

Nevertheless, an LLM may hide or deny its inability to truly watch videos, unless you call them out on it:

Having been asked to provide a subjective evaluation of a new research paper's associated videos, and having faked a real opinion, ChatGPT-4o eventually confesses that it cannot really view video directly.

Though models comparable to ChatGPT-4o are multimodal, and may no less than analyze photos (comparable to an extracted frame from a video, see image above), there are some issues even with this: firstly, there’s scant basis to provide credence to an LLM’s qualitative opinion, not least because LLMs are prone to ‘people-pleasing’ fairly than sincere discourse.

Secondly, many, if not most of a generated video’s issues are more likely to have a aspect that’s entirely lost in a frame grab – and so the examination of individual frames serves no purpose.

Finally, the LLM can only give a supposed ‘value judgement’ based (once more) on having absorbed text-based knowledge, as an example in regard to deepfake imagery or art history. In such a case trained domain knowledge allows the LLM to correlate analyzed visual qualities of a picture with learned embeddings based on insight:

The FakeVLM project offers targeted deepfake detection via a specialized multi-modal vision-language model. Source: https://arxiv.org/pdf/2503.14905

Source: https://arxiv.org/pdf/2503.14905

This is just not to say that an LLM cannot obtain information directly from a video; as an example, with using adjunct AI systems comparable to YOLO, an LLM could discover objects in a video – or could do that directly, if trained for an above-average variety of multimodal functionalities.

However the only way that an LLM could possibly evaluate a video subjectively (i.e.,) is thru applying a loss function-based metric that is either known to reflect human opinion well, or else is directly informed by human opinion.

Loss functions are mathematical tools used during training to measure how far a model’s predictions are from the proper answers. They supply feedback that guides the model’s learning: the greater the error, the upper the loss. As training progresses, the model adjusts its parameters to cut back this loss, progressively improving its ability to make accurate predictions.

Loss functions are used each to control the training of models, and in addition to calibrate algorithms which can be designed to evaluate the output of AI models (comparable to the evaluation of simulated photorealistic content from a generative video model).

Conditional Vision

One of the crucial popular metrics/loss functions is Fréchet Inception Distance (FID), which evaluates the standard of generated images by measuring the similarity between their distribution (which here means ‘) and that of real images.

Specifically, FID calculates the statistical difference, using means and covariances, between features extracted from each sets of images using the (often criticized) Inception v3 classification network. A lower FID rating indicates that the generated images are more much like real images, implying higher visual quality and variety.

Nevertheless, FID is actually comparative, and arguably self-referential in nature. To treatment this, the later Conditional Fréchet Distance (CFD, 2021) approach differs from FID by comparing generated images to real images, and evaluating a rating based on how well each sets match an , comparable to a (inevitably subjective) class label or input image.

In this fashion, CFID accounts for the way accurately images meet the intended conditions, not only their overall realism or diversity amongst themselves.

ource: https://github.com/Michael-Soloveitchik/CFID/

CFD follows a recent trend towards baking qualitative human interpretation into loss functions and metric algorithms. Though such a human-centered approach guarantees that the resulting algorithm won’t be ‘soulless’ or merely mechanical, it presents at the identical time a lot of issues: the opportunity of bias; the burden of updating the algorithm in keeping with latest practices, and the proven fact that this can remove the opportunity of consistent comparative standards over a period of years across projects; and budgetary limitations (fewer human contributors will make the determinations more specious, while the next number could prevent useful updates attributable to cost).

cFreD

This brings us to a latest paper from the US that apparently offers (cFreD), a novel tackle CFD that is designed to raised reflect human preferences by evaluating each visual quality and text-image alignment

Partial results from the new paper: image rankings (1–9) by different metrics for the prompt "A living room with a couch and a laptop computer resting on the couch." Green highlights the top human-rated model (FLUX.1-dev), purple the lowest (SDv1.5). Only cFreD matches human rankings. Please refer to the source paper for complete results, which we do not have room to reproduce here. Source: https://arxiv.org/pdf/2503.21721

Source: https://arxiv.org/pdf/2503.21721

The authors argue that existing evaluation methods for text-to-image synthesis, comparable to Inception Rating (IS) and FID, poorly align with human judgment because they measure only image quality without considering how images match their prompts:

The paper's tests indicate that the authors' proposed metric, cFreD, consistently achieves higher correlation with human preferences than FID, FDDINOv2, CLIPScore, and CMMD on three benchmark datasets (PartiPrompts, HPDv2, and COCO).

Concept and Method

The authors note that the present gold standard for evaluating text-to-image models involves gathering human preference data through crowd-sourced comparisons, much like methods used for big language models (comparable to the LMSys Arena).

For instance, the PartiPrompts Arena uses 1,600 English prompts, presenting participants with pairs of images from different models and asking them to pick their preferred image.

Similarly, the Text-to-Image Arena Leaderboard employs user comparisons of model outputs to generate rankings via ELO scores. Nevertheless, collecting any such human evaluation data is dear and slow, leading some platforms – just like the PartiPrompts Arena – to stop updates altogether.

The Artificial Analysis Image Arena Leaderboard, which ranks the currently-estimated leaders in generative visual AI. Source: https://artificialanalysis.ai/text-to-image/arena?tab=Leaderboard

Source: https://artificialanalysis.ai/text-to-image/arena?tab=Leaderboard

Although alternative methods trained on historical human preference data exist, their effectiveness for evaluating future models stays uncertain, because human preferences constantly evolve. Consequently, automated metrics comparable to FID, CLIPScore, and the authors’ proposed cFreD seem more likely to remain crucial evaluation tools.

The authors assume that each real and generated images conditioned on a prompt follow Gaussian distributions, each defined by conditional means and covariances. cFreD measures the expected Fréchet distance across prompts . This could be formulated either directly by way of conditional statistics or by combining unconditional statistics with cross-covariances involving the prompt.

By incorporating the prompt in this fashion, cFreD is capable of assess each the realism of the pictures and their consistency with the given text.

Data and Tests

To evaluate how well cFreD correlates with human preferences, the authors used image rankings from multiple models prompted with the identical text. Their evaluation drew on two sources: the Human Preference Rating v2 (HPDv2) test set, which incorporates nine generated images and one COCO ground truth image per prompt; and the aforementioned PartiPrompts Arena, which accommodates outputs from 4 models across 1,600 prompts.

The authors collected the scattered Arena data points right into a single dataset; in cases where the true image didn’t rank highest in human evaluations, they used the top-rated image because the reference.

To check newer models, they sampled 1,000 prompts from COCO’s train and validation sets, ensuring no overlap with HPDv2, and generated images using nine models from the Arena Leaderboard. The unique COCO images served as references on this a part of the evaluation.

The cFreD approach was evaluated through 4 statistical metrics: FID; FDDINOv2; CLIPScore; and CMMD. It was also evaluated against 4 learned metrics trained on human preference data: Aesthetic Rating; ImageReward; HPSv2; and MPS.

The authors evaluated correlation with human judgment from each a rating and scoring perspective: for every metric, model scores were reported and rankings calculated for his or her alignment with human evaluation results, with cFreD using DINOv2-G/14 for image embeddings and the OpenCLIP ConvNext-B Text Encoder for text embeddings†.

Previous work on learning human preferences measured performance using per-item rank accuracy, which computes rating accuracy for every image-text pair before averaging the outcomes.

The authors as a substitute evaluated cFreD using a rank accuracy, which assesses overall rating performance across the complete dataset; for statistical metrics, they derived rankings directly from raw scores; and for metrics trained on human preferences, they first averaged the rankings assigned to every model across all samples, then determined the ultimate rating from these averages.

Initial tests used ten frameworks: GLIDE; COCO; FuseDream; DALLE 2; VQGAN+CLIP; CogView2; Stable Diffusion V1.4; VQ-Diffusion; Stable Diffusion V2.0; and LAFITE.

Model rankings and scores on the HPDv2 test set using statistical metrics (FID, FDDINOv2, CLIPScore, CMMD, and cFreD) and human preference-trained metrics (Aesthetic Score, ImageReward, HPSv2, and MPS). Best results are shown in bold, second best are underlined.

Of the initial results, the authors comment:

Amongst all evaluated metrics, cFreD achieved the very best rank accuracy (91.1%), demonstrating – the authors contend – strong alignment with human judgments.

HPSv2 followed with 88.9%, while FID and FDDINOv2 produced competitive scores of 86.7%. Although metrics trained on human preference data generally aligned well with human evaluations, cFreD proved to be probably the most robust and reliable overall.

Below we see the outcomes of the second testing round, this time on PartiPrompts Arena, using SDXL; Kandinsky 2; Würstchen; and Karlo V1.0.

Model rankings and scores on PartiPrompt using statistical metrics (FID, FDDINOv2, CLIPScore, CMMD, and cFreD) and human preference-trained metrics (Aesthetic Score, ImageReward, and MPS). Best results are in bold, second best are underlined.

Here the paper states:

Finally the authors conducted an evaluation on the COCO dataset using nine modern text-to-image models: FLUX.1[dev]; Playgroundv2.5; Janus Pro; and Stable Diffusion variants SDv3.5-L Turbo, 3.5-L, 3-M, SDXL, 2.1, and 1.5.

Human preference rankings were sourced from the Text-to-Image Leaderboard, and given as ELO scores:

Model rankings on randomly sampled COCO prompts using automatic metrics (FID, FDDINOv2, CLIPScore, CMMD, and cFreD) and human preference-trained metrics (Aesthetic Score, ImageReward, HPSv2, and MPS). A rank accuracy below 0.5 indicates more discordant than concordant pairs, and best results are in bold, second best are underlined.

Regarding this round, the researchers state:

The authors also tested Inception V3 as a backbone, drawing attention to its ubiquity within the literature, and located that InceptionV3 performed reasonably, but was outmatched by transformer-based backbones comparable to DINOv2-L/14 and ViT-L/16, which more consistently aligned with human rankings – they usually contend that this supports replacing InceptionV3 in modern evaluation setups.

Win rates showing how often each image backbone's rankings matched the true human-derived rankings on the COCO dataset.

Conclusion

It’s clear that while human-in-the-loop solutions are the optimal approach to the event of metric and loss functions, the dimensions and frequency of updates vital to such schemes will proceed to make them impractical – perhaps until such time as widespread public participation in evaluations is usually incentivized; or, as has been the case with CAPTCHAs, enforced.

The credibility of the authors’ latest system still is dependent upon its alignment with human judgment, albeit at one remove greater than many recent human-participating approaches; and cFreD’s legitimacy subsequently stays still in human preference data (obviously, since without such a benchmark, the claim that cFreD reflects human-like evaluation could be unprovable).

Arguably, enshrining our current criteria for ‘realism’ in generative output right into a metric function may very well be a mistake within the long-term, since our definition for this idea is currently under assault from the brand new wave of generative AI systems, and set for frequent and significant revision.

†

Teaching AI to Give Higher Video Critiques

I Read the Book, Didn’t See the Movie

Conditional Vision

cFreD

Concept and Method

Data and Tests

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Generative coding

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Speed up StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

a Leaderboard for Real World Use Cases

Patch Time Series Transformer in Hugging Face

Teaching AI to Give Higher Video Critiques

I Read the Book, Didn’t See the Movie

Conditional Vision

cFreD

Concept and Method

Data and Tests

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.