Imagine a radiologist examining a chest X-ray from a brand new patient. She notices the patient has swelling within the tissue but doesn’t have an enlarged heart. Trying to speed up diagnosis, she might use a vision-language machine-learning model to look for reports from similar patients.
But when the model mistakenly identifies reports with each conditions, the most probably diagnosis could possibly be quite different: If a patient has tissue swelling and an enlarged heart, the condition may be very prone to be cardiac related, but with no enlarged heart there could possibly be several underlying causes.
In a brand new study, MIT researchers have found that vision-language models are extremely prone to make such a mistake in real-world situations because they don’t understand negation — words like “no” and “doesn’t” that specify what is fake or absent.
“Those negation words can have a really significant impact, and if we are only using these models blindly, we may run into catastrophic consequences,” says Kumail Alhamoud, an MIT graduate student and lead creator of this study.
The researchers tested the power of vision-language models to discover negation in image captions. The models often performed in addition to a random guess. Constructing on those findings, the team created a dataset of images with corresponding captions that include negation words describing missing objects.
They show that retraining a vision-language model with this dataset results in performance improvements when a model is asked to retrieve images that don’t contain certain objects. It also boosts accuracy on multiple selection query answering with negated captions.
However the researchers caution that more work is required to handle the foundation causes of this problem. They hope their research alerts potential users to a previously unnoticed shortcoming that might have serious implications in high-stakes settings where these models are currently getting used, from determining which patients receive certain treatments to identifying product defects in manufacturing plants.
“It is a technical paper, but there are larger issues to contemplate. If something as fundamental as negation is broken, we shouldn’t be using large vision/language models in most of the ways we’re using them now — without intensive evaluation,” says senior creator Marzyeh Ghassemi, an associate professor within the Department of Electrical Engineering and Computer Science (EECS) and a member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems.
Ghassemi and Alhamoud are joined on the paper by Shaden Alshammari, an MIT graduate student; Yonglong Tian of OpenAI; Guohao Li, a former postdoc at Oxford University; Philip H.S. Torr, a professor at Oxford; and Yoon Kim, an assistant professor of EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. The research will likely be presented at Conference on Computer Vision and Pattern Recognition.
Neglecting negation
Vision-language models (VLM) are trained using huge collections of images and corresponding captions, which they learn to encode as sets of numbers, called vector representations. The models use these vectors to tell apart between different images.
A VLM utilizes two separate encoders, one for text and one for images, and the encoders learn to output similar vectors for a picture and its corresponding text caption.
“The captions express what’s in the pictures — they’re a positive label. And that is definitely the entire problem. Nobody looks at a picture of a dog jumping over a fence and captions it by saying ‘a dog jumping over a fence, with no helicopters,’” Ghassemi says.
Since the image-caption datasets don’t contain examples of negation, VLMs never learn to discover it.
To dig deeper into this problem, the researchers designed two benchmark tasks that test the power of VLMs to know negation.
For the primary, they used a big language model (LLM) to re-caption images in an existing dataset by asking the LLM to take into consideration related objects not in a picture and write them into the caption. Then they tested models by prompting them with negation words to retrieve images that contain certain objects, but not others.
For the second task, they designed multiple selection questions that ask a VLM to pick out probably the most appropriate caption from an inventory of closely related options. These captions differ only by adding a reference to an object that doesn’t appear within the image or negating an object that does appear within the image.
The models often failed at each tasks, with image retrieval performance dropping by nearly 25 percent with negated captions. When it got here to answering multiple selection questions, the very best models only achieved about 39 percent accuracy, with several models acting at and even below random probability.
One reason for this failure is a shortcut the researchers call affirmation bias — VLMs ignore negation words and give attention to objects in the pictures as a substitute.
“This doesn’t just occur for words like ‘no’ and ‘not.’ No matter the way you express negation or exclusion, the models will simply ignore it,” Alhamoud says.
This was consistent across every VLM they tested.
“A solvable problem”
Since VLMs aren’t typically trained on image captions with negation, the researchers developed datasets with negation words as a primary step toward solving the issue.
Using a dataset with 10 million image-text caption pairs, they prompted an LLM to propose related captions that specify what’s excluded from the pictures, yielding recent captions with negation words.
That they had to be especially careful that these synthetic captions still read naturally, or it could cause a VLM to fail in the actual world when faced with more complex captions written by humans.
They found that finetuning VLMs with their dataset led to performance gains across the board. It improved models’ image retrieval abilities by about 10 percent, while also boosting performance within the multiple-choice query answering task by about 30 percent.
“But our solution isn’t perfect. We are only recaptioning datasets, a form of knowledge augmentation. We haven’t even touched how these models work, but we hope it is a signal that it is a solvable problem and others can take our solution and improve it,” Alhamoud says.
At the identical time, he hopes their work encourages more users to think concerning the problem they need to use a VLM to unravel and design some examples to check it before deployment.
In the longer term, the researchers could expand upon this work by teaching VLMs to process text and pictures individually, which can improve their ability to know negation. As well as, they might develop additional datasets that include image-caption pairs for specific applications, equivalent to health care.