The ‘Download More Labels!’ Illusion in AI Research

A typical view in current machine learning research is that machine learning itself may be used to enhance the standard of AI dataset annotations – particularly image captions intended to be used in vision-language models (VLMs). This line of pondering is driven by the high cost of human annotation, and the added burden of supervising annotator performance.

Arguably that is the AI equivalent of the early 2000s ‘download more RAM’ meme, which satirized the notion that a hardware limitation might be resolved with a software-based fix.

It is also an under-regarded issue; while recent AI models attract widespread attention in each public and business spheres, annotation often appears to be a trivial detail in machine learning pipelines, overshadowed by the joy surrounding broader frameworks.

In fact, the capability of machine learning systems to acknowledge and reproduce patterns (the central use case of nearly all AI systems) relies on the standard and consistency of real-world annotations – labels and phrases that are created or adjudicated by real people, often making subjective judgments about individual data points in non-ideal circumstances.

Inevitably, systems which seek to watch and reproduce patterns in annotator behavior (and thereby replace human annotators and facilitate accurate labeling at scale) cannot hope to perform well on data contained within the examples taken from human observers. Nothing ‘similar’ is sort of the identical, and cross-domain equivalency stays a problematic pursuit in computer vision.

The ‘upstream data buck’ has to stop somewhere, and on this case, that is exactly where it stops – with a human cerebellum making some type of subjective distinction as a way to codify data for a man-made system.

The RAG Trade

Until recently, the inaccuracies arising from under-curated dataset annotations were, perhaps, seen as acceptable collateral damage within the context of the imperfect but still-marketable results obtained from generative AI systems.

Indeed, only this 12 months a study from Singapore concluded that hallucinations – i.e., the occasions when AI systems invent things that undermine our intentions – are inevitable, and certain in with the conceptual architecture of such systems.

To counter this, RAG-based agents – which may ‘confirm’ facts through web searches – have gotten popular in research and applied business solutions. Nevertheless, they add to the resource cost and to the latency in queries; moreover, novel information applied to a trained model cannot compete with the more intricate and deeply-intertwined connections that characterize the native layers in a trained model.

It might due to this fact be higher if the annotation data that informs these models was significantly less flawed in the primary place, even when it can’t be perfect (not least because this activity encroaches into the realm of human subjectivity).

RePOPE

A brand new paper from Germany highlights the issues that arise from counting on older, widely used datasets, focusing specifically on the accuracy and reliability of their image captions. The researchers’ findings suggest that label errors in benchmarks can mask or misrepresent hallucination in vision-language models.

Source: https://arxiv.org/pdf/2504.15707

Imagine a model is shown a picture of a street scene and asked whether there’s a bicycle in it. The model answers . If the benchmark dataset says there is no such thing as a bicycle, the model is marked . But when a bicycle is within the image, and was simply missed during annotation, then the model’s answer was correct, and the benchmark has failed. Errors like this may accumulate across a dataset, giving a distorted picture of which models are accurate and that are susceptible to hallucination.

Thus, when incorrect or ambiguous annotations are treated as ground truth, models may appear to hallucinate after they are correct, or else seem accurate after they are usually not, distorting each the measurement of hallucination and the rating of model performance, and making it harder to diagnose or address the issue with certainty.

The brand new paper revisits a widely used benchmark called Polling-based Object Probing Evaluation (POPE), which tests whether vision-language models can appropriately say what’s or isn’t in a picture.

POPE relies on labels from the influential Microsoft COCO: Common Objects in Context (MSCOCO) dataset, a set of annotated images which has long been treated as offering a superb level of annotation accuracy.

POPE evaluates object hallucination in large vision-language models by reframing the issue as a binary classification task. Quite than parsing generated captions, the system poses easy inquiries to the model about whether specific objects are present in a picture, using templates corresponding to .

Examples of object hallucination in vision-language models. Bolded labels indicate objects marked as present in the original annotations, while red labels show objects hallucinated by the models. The left example reflects a traditional instruction-based evaluation, while the three examples on the right are drawn from different POPE benchmark variants.. Source: https://aclanthology.org/2023.emnlp-main.20.pdf

Source: https://aclanthology.org/2023.emnlp-main.20.pdf

Ground-truth objects (answer: ) are paired with sampled non-existent objects (answer: ), chosen through random, frequent (), or co-occurrence-based () strategies. This setup allows for more stable, prompt-insensitive evaluation of hallucination without counting on complex rule-based caption evaluation.

The authors of the recent paper – titled – challenge the assumed accuracy of POPE by rechecking the labels on the benchmark’s images (i.e., MSCOCO) – and finding that a surprising number are fallacious or unclear.

Source: https://arxiv.org/pdf/1405.0312

These errors change the best way models are ranked, with some that originally performed well falling behind when judged against corrected labels.

In tests, the authors evaluated a variety of open-weight vision-language models on each the unique POPE benchmark and their re-labeled version.

In line with the paper, the corrected annotations led to notable changes in model rankings, particularly in F1 scores, with several high-performing models under POPE dropping in position under RePOPE.

The authors contend that this shift illustrates the extent to which annotation errors can obscure the actual hallucination behavior of models, and so they present RePOPE as a more reliable tool for assessing hallucination vulnerability.

In another example from the new paper, we see how the original POPE captions fail to discern subtle objects, such as a person sitting beside the cabin of a tram in the rightmost photo, or the chair obscured by the tennis player in the second photo from the left.

Method and Tests

The researchers re-labeled all of the annotations in the unique MSCOCO dataset, with two human labelers assigned to every data instance. Where ambiguity as to the standard of the unique labels arose (as within the examples below), these results were set except for the testing round.

Ambiguous cases, where labeling inconsistencies in POPE reflect unclear category boundaries. For instance, a teddy bear labeled as a bear, a motorcycle as a bicycle, or airport vehicles as cars. These cases are excluded from RePOPE due to the subjective nature of such classifications, as well as the inconsistencies in MSCOCO's original labels.

The paper states:

Results of the re-annotation: the positive questions are shared across all three POPE variants. Among those labeled 'Yes' in POPE, 9.3 percent were found to be incorrect and 13.8 percent were classified as ambiguous. For the 'No' questions, 1.7 percent were mislabeled and 4.3 percent were ambiguous.

The authors evaluated a variety of open-weight models on POPE and on RePOPE, across diverse architectures and model sizes. The models chosen included a number of the leading architectures on the OpenVLM leaderboard: InternVL2.5 (8B/26B/38B/78B and 8B-MPO/26B-MPO); LLaVA-NeXT; Vicuna; Mistral 7b; Llama; LLaVA-OneVision; Ovis2 (1B/2B/4B/8B); PaliGemma-3B; and PaliGemma2 (3B/10B).

Initial results: the high error rate in the original positive labels leads to a sharp drop in true positives across all models. False positives vary across subsets, nearly doubling on the random subset, but remaining largely unchanged on the popular subset, and show a slight decrease on the adversarial subset. The relabeling has a major effect on F1-based rankings. Models like Ovis2-4B and Ovis2-8B, which performed well on the popular and adversarial splits in POPE, also rise to the top on the random subset under RePOPE.. Please refer to the source PDF for better resolution.

The outcomes graphs above illustrate how the variety of true positives and false positives changes after correcting the labels within the benchmark.

True positives fell across all models, showing that they were often credited for proper answers when those answers were only correct under faulty labels, while false positives followed a more varied pattern.

On the ‘random’ version of POPE, false positives nearly for a lot of models, indicating that a major variety of objects flagged as hallucinations were actually present in the pictures but had been missed in the unique annotations. On this case, many supposed model errors were in reality dataset labeling mistakes.

For the ‘adversarial’ version of POPE, where questions were based on objects that continuously co-occur, false positives decreased. This likely reflects the next likelihood that the supposedly absent object was but left .

Although these shifts affected precision and recall, model rankings stayed relatively stable for each metrics.

The F1 rating – POPE’s fundamental evaluation measure – was much more sensitive to the label corrections. On the random subset, models that ranked near the highest under the unique labels, corresponding to InternVL2.5-8B and -26B, dropped to the underside when scored with RePOPE. Others, corresponding to Ovis2-4B and -8B, rose to the highest.

An identical pattern emerged within the accuracy scores, though the authors note that these may now be biased, because the corrected dataset comprises an uneven variety of positive and negative examples.

The authors argue that the strong impact of annotation errors on benchmark results underscores the necessity for high-quality data. To support more reliable evaluation of object hallucination, they’ve released the corrected labels at GitHub.

Nevertheless, they note that this re-labeling doesn’t fully address the benchmark’s saturation, since many models still achieve true positive and true negative rates above 90%. They suggest that additional benchmarks, corresponding to DASH-B, which uses a more difficult set of negative examples, ought to be used alongside RePOPE.

Conclusion

This particular experiment was possible due to the very small scale of the dataset involved. Proving the identical hypothesis on hyperscale datasets would involve working on very limited fragments of the information; in highly diverse large datasets, it would prove near-impossible to isolate statistically representative and semantically coherent groupings – potentially skewing the outcomes.

Even when it were possible, what treatment would there be under the present state-of-the-art? The argument moves back inevitably towards the necessity for higher and more copious human annotation.

On this regard, ‘higher’ and ‘more copious’ exist as separate problems in their very own right, since one can obtain a greater volume of annotations through race-to-the-bottom economies corresponding to Amazon Mechanical Turk (AMT). Obviously, this potentially exploitative sub-economy continuously results in inferior results.

Alternatively, one could farm out annotation tasks to economic regions where the identical expenditure would yield a bigger quantity of annotations. Nevertheless, the further removed the annotator is from the intended use case of the model their labels will shape, the less likely it’s that the resulting model will align with the needs or expectations of the goal domain.

This due to this fact stays probably the most persistent and unresolved challenges within the economics of machine learning development.

The ‘Download More Labels!’ Illusion in AI Research

The RAG Trade

RePOPE

Method and Tests

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Tips on how to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k

OpenAI Is Quietly Constructing Your Next Health Assistant

Meta’s chief AI scientist maps his exit

Improving VMware migration workflows with agentic AI

The Three Ages of Data Science: When to Use Traditional Machine Learning, Deep Learning, or an LLM (Explained with One Example)

The ‘Download More Labels!’ Illusion in AI Research

The RAG Trade

RePOPE

Method and Tests

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.