Using AI Hallucinations to Evaluate Image Realism

Recent research from Russia proposes an unconventional method to detect unrealistic AI-generated images – not by improving the accuracy of enormous vision-language models (LVLMs), but by intentionally leveraging their tendency to hallucinate.

The novel approach extracts multiple ‘atomic facts’ about a picture using LVLMs, then applies natural language inference (NLI) to systematically measure contradictions amongst these statements – effectively turning the model’s flaws right into a diagnostic tool for detecting images that defy commonsense.

Source: https://arxiv.org/pdf/2503.15948

Asked to evaluate the realism of the second image, the LVLM can see that is amiss, for the reason that depicted camel has three humps, which is unknown in nature.

Nevertheless, the LVLM initially conflates with , since that is the one way you may ever see three humps in a single ‘camel picture’. It then proceeds to hallucinate something much more unlikely than three humps (i.e., ‘two heads’) and never details the very thing that appears to have triggered its suspicions – the improbable extra hump.

The researchers of the brand new work found that LVLM models can perform this sort of evaluation natively, and on a par with (or higher than) models which have been fine-tuned for a task of this kind. Since fine-tuning is complicated, expensive and moderately brittle when it comes to downstream applicability, the invention of a native use for one among the best roadblocks in the present AI revolution is a refreshing twist on the overall trends within the literature.

Open Assessment

The importance of the approach, the authors assert, is that it may possibly be deployed with frameworks. While a sophisticated and high-investment model equivalent to ChatGPT can (the paper concedes) potentially offer higher leads to this task, the arguable real value of the literature for nearly all of us (and particularly for the hobbyist and VFX communities) is the opportunity of incorporating and developing recent breakthroughs in local implementations; conversely every little thing destined for a proprietary industrial API system is subject to withdrawal, arbitrary price rises, and censorship policies which might be more more likely to reflect an organization’s corporate concerns than the user’s needs and responsibilities.

The recent paper is titled , and comes from five researchers across Skolkovo Institute of Science and Technology (Skoltech), Moscow Institute of Physics and Technology, and Russian corporations MTS AI and AIRI. The work has an accompanying GitHub page.

Method

The authors use the Israeli/US WHOOPS! Dataset for the project:

Examples of impossible images from the WHOOPS! Dataset. It's notable how these images assemble plausible elements, and that their improbability must be calculated based on the concatenation of these incompatible facets. Source: https://whoops-benchmark.github.io/

Source: https://whoops-benchmark.github.io/

The dataset comprises 500 synthetic images and over 10,874 annotations, specifically designed to check AI models’ commonsense reasoning and compositional understanding. It was created in collaboration with designers tasked with generating difficult images via text-to-image systems equivalent to Midjourney and the DALL-E series – producing scenarios difficult or not possible to capture naturally:

Source: https://huggingface.co/datasets/nlphuji/whoops

The brand new approach works in three stages: first, the LVLM (specifically LLaVA-v1.6-mistral-7b) is prompted to generate multiple easy statements – called ‘atomic facts’ – describing a picture. These statements are generated using Diverse Beam Search, ensuring variability within the outputs.

Diverse Beam Search, first proposed in, produces a better variety of caption options by optimizing for a diversity-augmented objective. Source: https://arxiv.org/pdf/1610.02424

Source: https://arxiv.org/pdf/1610.02424

Next, each generated statement is systematically in comparison with every other statement using a Natural Language Inference model, which assigns scores reflecting whether pairs of statements entail, contradict, or are neutral toward one another.

Contradictions indicate hallucinations or unrealistic elements throughout the image:

Finally, the tactic aggregates these pairwise NLI scores right into a single ‘reality rating’ which quantifies the general coherence of the generated statements.

The researchers explored different aggregation methods, with a clustering-based approach performing best. The authors applied the k-means clustering algorithm to separate individual NLI scores into two clusters, and the centroid of the lower-valued cluster was then chosen as the ultimate metric.

Using two clusters directly aligns with the binary nature of the classification task, i.e., distinguishing realistic from unrealistic images. The logic is comparable to easily picking the bottom rating overall; nevertheless, clustering allows the metric to represent the common contradiction across multiple facts, moderately than counting on a single outlier.

Data and Tests

The researchers tested their system on the WHOOPS! baseline benchmark, using rotating test splits (i.e., cross-validation). Models tested were BLIP2 FlanT5-XL and BLIP2 FlanT5-XXL in splits, and BLIP2 FlanT5-XXL in zero-shot format (i.e., without additional training).

For an instruction-following baseline, the authors prompted the LVLMs with the phrase , which prior research found effective for spotting unrealistic images.

The models evaluated were LLaVA 1.6 Mistral 7B, LLaVA 1.6 Vicuna 13B, and two sizes (7/13 billion parameters) of InstructBLIP.

The testing procedure was centered on 102 pairs of realistic and unrealistic (‘weird’) images. Each pair was comprised of 1 normal image and one commonsense-defying counterpart.

Three human annotators labeled the photographs, reaching a consensus of 92%, indicating strong human agreement on what constituted ‘weirdness’. The accuracy of the assessment methods was measured by their ability to accurately distinguish between realistic and unrealistic images.

The system was evaluated using three-fold cross-validation, randomly shuffling data with a set seed. The authors adjusted weights for entailment scores (statements that logically agree) and contradiction scores (statements that logically conflict) during training, while ‘neutral’ scores were fixed at zero. The ultimate accuracy was computed as the common across all test splits.

Comparison of different NLI models and aggregation methods on a subset of five generated facts, measured by accuracy.

Regarding the initial results shown above, the paper states:

The authors found that the optimal weights consistently favored contradiction over entailment, indicating that contradictions were more informative for distinguishing unrealistic images. Their method outperformed all other zero-shot methods tested, closely approaching the performance of the fine-tuned BLIP2 model:

Performance of various approaches on the WHOOPS! benchmark. Fine-tuned (ft) methods appear at the top, while zero-shot (zs) methods are listed underneath. Model size indicates the number of parameters, and accuracy is used as the evaluation metric.

Additionally they noted, somewhat unexpectedly, that InstructBLIP performed higher than comparable LLaVA models given the identical prompt. While recognizing GPT-4o’s superior accuracy, the paper emphasizes the authors’ preference for demonstrating practical, open-source solutions, and, it seems, can reasonably claim novelty in explicitly exploiting hallucinations as a diagnostic tool.

Conclusion

Nevertheless, the authors acknowledge their project’s debt to the 2024 FaithScore outing, a collaboration between the University of Texas at Dallas and Johns Hopkins University.

Illustration of how FaithScore evaluation works. First, descriptive statements within an LVLM-generated answer are identified. Next, these statements are broken down into individual atomic facts. Finally, the atomic facts are compared against the input image to verify their accuracy. Underlined text highlights objective descriptive content, while blue text indicates hallucinated statements, allowing FaithScore to deliver an interpretable measure of factual correctness. Source: https://arxiv.org/pdf/2311.01477

Source: https://arxiv.org/pdf/2311.01477

FaithScore measures faithfulness of LVLM-generated descriptions by verifying consistency against image content, while the brand new paper’s methods explicitly exploit LVLM hallucinations to detect unrealistic images through contradictions in generated facts using Natural Language Inference.

The brand new work is, naturally, dependent upon the eccentricities of current language models, and on their disposition to hallucinate. If model development should ever bring forth a wholly non-hallucinating model, even the overall principles of the brand new work would not be applicable. Nevertheless, this stays a difficult prospect.

Using AI Hallucinations to Evaluate Image Realism

Open Assessment

Method

Data and Tests

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How Prezi is leveraging the Hub and the Expert Support Program to speed up their ML roadmap

A Look Back and Forward

High quality-tuning Florence-2 – Microsoft’s Cutting-edge Vision Language Models

The Importance of Data Quality

The Machine Learning “Advent Calendar” Bonus 2: Gradient Descent Variants in Excel

Using AI Hallucinations to Evaluate Image Realism

Open Assessment

Method

Data and Tests

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.