GPT-4V Has Directional Dyslexia

Shows our study based on the WSDM 2023 Toloka VQA Challenge

A 12 months has passed because the Toloka Visual Query Answering (VQA) Challenge on the WSDM Cup 2023, and as we predicted back then, the winning machine-learning solution didn’t match as much as the human baseline. Nevertheless, this past 12 months has been full of breakthroughs in Generative AI. It appears like every other article flips between mentioning what OpenAI’s GPT models can’t do and praising what they do higher than us.

Since autumn 2023, GPT-4 Turbo has gained “vision” capabilities, meaning it accepts images as input and it may possibly now directly take part in VQA challenges. We were curious to check its ability against the human baseline in our Toloka challenge, wondering if that gap has finally closed.

Visual Query Answering

Visual Query Answering (VQA) is a multi-disciplinary artificial intelligence research problem, focused on making AI interpret images and answer related questions in natural language. This area has various applications: aiding visually impaired individuals, enriching educational content, supporting image search capabilities, and providing video search functionalities.

The event of VQA “comes with great responsibility”, akin to ensuring the reliability and safety of the technology application. With AI systems having vision capabilities, the potential for misinformation increases, considering claims that images paired with false information could make statements appear more credible.

Certainly one of the subfields of the VQA domain, VQA Grounding, is just not only about answers to visual questions but additionally connecting those answers to elements inside the image. This subfield has great potential for applications like Mixed Reality (XR) headsets, educational tools, and online shopping, improving user interaction experience by directing attention to specific parts of a picture. The goal of the Toloka VQA Challenge was to support the event of VQA grounding.

Toloka’s VQA Challenge recap

Within the Toloka VQA Challenge, the duty was to discover a single object and put it in a bounding box, based on an issue that describes the thing’s functions relatively than its visual characteristics. For instance, as a substitute of asking to seek out something round and red, a typical query could be “What object in the image is nice in a salad and on a pizza?” This reflects the power of humans to perceive objects when it comes to their utility. It’s like being asked to seek out “a thing to swat a fly with” whenever you see a table with a newspaper, a coffee mug, and a pair of glasses — you’d know what to choose with no visual description of the thing.

Query: What can we use to chop the pizza into slices?

Image from “*Toloka VQA Challenge*” (CC BY 4.0)

The challenge required integrating visual, textual, and customary sense knowledge at the identical time. As a baseline approach, we proposed to mix YOLOR and CLIP as separate visual and textual backbone models. Nevertheless, the winning solution didn’t use a two-tower paradigm in any respect, selecting as a substitute the Uni-Perceiver model with a ViT-Adapter for higher localization. It achieved a high final Intersection over Union (IoU) rating of 76.347, nevertheless, it didn’t reach the crowdsourcing baseline of an IoU of 87.

Considering this vast gap between human and AI solutions, we were very curious to see how GPT-4V would perform within the Toloka VQA Challenge. For the reason that challenge was based on the MS COCO dataset, used countless times in Computer Vision (for instance, within the Visual Spatial Reasoning dataset), and, due to this fact, likely “known” to GPT-4 from its training data, there was a possibility that GPT-4V might come closer to the human baseline.

GPT-4V and Toloka VQA Challenge

Initially, we wanted to seek out out if GPT-4V could handle the Toloka VQA Challenge as is.

Nevertheless, though GPT-4V mostly defined the thing accurately, it had serious trouble providing meaningful coordinates for bounding boxes. This wasn’t entirely unexpected since OpenAI’s guide acknowledges GPT-4V’s limitations in tasks that require identifying precise spatial localization of an object on a picture.

This led us to explore how well GPT-4 handles the identification of basic high-level locations in a picture. Can it work out where things are — not exactly, but in the event that they’re on the left, within the middle, or on the right? Or on the top, within the middle, or on the bottom? Since these aren’t precise locations, it could be doable for GPT-4V, especially because it’s been trained on thousands and thousands of images paired with captions mentioning the thing’s directional locations. Educational materials often describe pictures intimately (just consider textbooks on brain structure that mention parts like “dendrites” on the “top left” or “axons” on the “bottom right” of a picture).

The understanding of LLM’s and MLM’s spatial reasoning limitations, even easy reasoning like we discussed above, is crucial in practical applications. The mixing of GPT-4V into the “Be My Eyes” application, which assists visually impaired users by interpreting images, perfectly illustrates this importance. Despite the skills of GPT-4V, the applying advises caution, highlighting the technology’s current inability to completely substitute for human judgment in critical safety and health contexts. Nevertheless, exact topics where the technology is unable to perform well will not be identified explicitly.

GPT-4V and spatial reasoning

For our exploration into GPT-4V’s reasoning on basic locations of objects on images, we randomly selected 500 image-question pairs from a bigger set of 4,500 pairs, the competition’s private test dataset. We tried to reduce the probabilities of our test data leaking to the training data of GPT-4V since this subset of the competition data was released the newest within the competition timeline.

Out of those 500 pairs, 25 were rejected by GPT-4V, flagged as ‘invalid image’. We suspect this rejection was as a consequence of built-in safety measures, likely triggered by the presence of objects that may very well be classified as Personally Identifiable (PI) information, akin to peoples’ faces. The remaining 475 pairs were used as the premise for our experiments.

Understanding how things are positioned in relation to one another, like determining what’s left, middle or right and top, middle or bottom isn’t as straightforward because it might sound. Lots depends upon the observer’s viewpoint, whether the thing has a front, and if that’s the case, what are their orientations. So, spatial reasoning in humans may depend on significant inductive bias concerning the world as the results of our evolutionary history.

Query: What protects the eyes from lamp glare?

Take an example pair with a lampshade above, sampled from the experiment data. One person might say it’s towards the top-left of the image since the lampshade leans a bit left, while one other might call it middle-top, seeing it centered in the image. Each views have a degree. It’s tough to make strict rules for identifying locations because objects can have every kind of shapes and parts, like a lamp’s long cord, which could change how we see where it’s placed.

Keeping this complexity in mind, we planned to check out at the very least two different methods for labeling the bottom truth of where things are in a picture.

It really works in the next way: if the difference in pixels between the middle of the image and the middle of the thing (marked by its bounding box) is lower than or equal to a certain percentage of the image’s width (for horizontal position) or height (for vertical position), then we label the thing as being in the center. If the difference is more, it gets labeled as either left or right (or top or bottom). We settled on using 2% as the brink percentage. This decision was based on observing how this difference appeared for objects of varied sizes relative to the general size of the image.

object_horizontal_center = bb_left + (bb_right - bb_left) / 2
image_horizontal_center = image_width / 2
difference = object_horizontal_center - image_horizontal_center
if difference > (image_width * 0.02):
return 'right'
else if difference < (-1 * image_width * 0.02):
return 'left'
else:
return 'middle'For our first approach, we selected easy automated heuristics to work out where objects are placed in an image, each horizontally and vertically. This concept got here from an assumption that GPT-4V might use algorithms present in publicly available code for tasks of an analogous nature.

For the second approach, we used labeling with crowdsourcing. Listed below are the small print on how the crowdsourcing project was arrange:

Images were shown to the gang without bounding boxes to encourage less biased (on a ground truth answer) labeling of an object’s location, as one would in responding to a question regarding the thing’s placement in a visible context.
GPT-4V’s answers were displayed as each a touch and a technique to validate its object detection accuracy.
Participants had the choice to report if an issue couldn’t be clearly answered with the given image, removing any potential ambiguous/grey-zone cases from the dataset.

To make sure the standard of the crowdsourced responses, I reviewed all instances where GPT-4’s answers didn’t match the gang’s. I couldn’t see either GPT-4V’s or the gang’s responses during this review process, which allowed me to regulate the labels without preferential bias.

Image by writer. *Labeling interface in* *Toloka*

GPT-4V has directional dyslexia

We opted for accuracy as our evaluation metric since the classes in our dataset were evenly distributed. After evaluating GPT-4V’s performance against the bottom truth — established through crowdsourcing and heuristic methods — on 475 images, we excluded 45 pairs that the gang found difficult to reply. The remaining data revealed that GPT-4V’s accuracy in identifying each horizontal and vertical positions was remarkably low, at around 30%, when put next to each the crowdsourced and heuristic labels.

*Accuracy of GPT-4V’s answers in comparison with* ***automated heuristics***

*Accuracy of GPT-4V’s answers in comparison with* ***crowd labeling***

Even once we accepted GPT-4V’s answer as correct if it matched either the crowdsourced or heuristic approach, its accuracy still didn’t reach 50%, leading to 40.2%.

To further validate these findings, we manually reviewed 100 image-question pairs that GPT-4V had incorrectly labeled.

By directly asking GPT-4V to specify the objects’ locations and comparing its responses, we confirmed the initial results.

GPT-4V consistently confused left and right, top and bottom, so if GPT-4V is your navigator, be prepared to take the scenic route — unintentionally.

Nevertheless, GPT-4V’s object recognition capabilities are impressive, achieving an accuracy rate of 88.84%. This means that by integrating GPT-4V with specialized object detection tools, we could potentially match (and even exceed) the human baseline. That is the following objective of our research.

Prompt engineering & directional dyslexia

To make sure we’re not mentioning the restrictions of GPT-4V with none prompt optimization efforts, in order to not change into what we hate, we explored various prompt engineering techniques mentioned within the research literature as ones enhancing spatial reasoning in LLMs.

Query: What’s used because the symbol or emblem of a rustic?

We applied three discovered prompt engineering techniques on the experimental dataset example above that GPT-4V stubbornly and consistently misinterpreted. The flag which is asked about is positioned within the middle-right of the image.

The “Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic” paper introduces a way combining Chain of Thought (CoT) with position annotations, specifically center annotations, called Grounding CoT (GCoT). Within the GCoT setting, the authors prompt the model to offer CoT together with center points for every mentioned object. For the reason that authors specifically trained their model to offer coordinates of objects on a picture, we needed to adapt the prompt engineering technique to a less strict setting, asking the model to offer reasoning concerning the object’s location based on the middle of the thing.

Image by writer. ***Grounding CoT approach*** *(correct answer is* ***middle-right***)

The study “Mapping Language Models to Grounded Conceptual Spaces” by Patel & Pavlick (2022) illustrates that GPT-3 can grasp spatial and cardinal directions even inside a text-based grid by ‘orienting’ the models with specific word forms learned during training. They substitute traditional directional terms using north/south and west/east as a substitute of top/bottom and left/right, to guide the model’s spatial reasoning.

Image by writer. ***Cardinal directions approach*** *(correct answer is* ***east-south***)

Lastly, the “Visual Spatial Reasoning” article states the importance of various perspectives in spatial descriptions: the intrinsic frame centered on an object (e.g. behind the chair = side with a backrest), the relative frame from the viewer’s perspective, and the absolute frame using fixed coordinates (e.g. “north” of the chair). English typically favors the relative frame, so we explicitly mentioned it within the prompt, hoping to refine GPT-4V’s spatial reasoning.

Image by writer. ***Relative frame approach*** *(correct answer is* ***middle-right***)

As we are able to see from the examples, GPT-4V’s challenges with basic spatial reasoning persist.

Conclusions and future work

GPT-4V struggles with easy spatial reasoning, like identifying object horizontal and vertical positions on a high level in images. Yet its strong object recognition skills based just on implicit functional descriptions are promising. Our next step is to mix GPT-4V with models specifically trained for object detection in images. Let’s see if this mixture can beat the human baseline within the Toloka VQA challenge!

GPT-4V Has Directional Dyslexia

Shows our study based on the WSDM 2023 Toloka VQA Challenge

Visual Query Answering

Toloka’s VQA Challenge recap

GPT-4V and Toloka VQA Challenge

GPT-4V and spatial reasoning

GPT-4V has directional dyslexia

Prompt engineering & directional dyslexia

Conclusions and future work

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

The Ultimate Guide to Power BI Aggregations

Deploy Your AI Assistant to Monitor and Debug n8n Workflows Using Claude and MCP

Tips on how to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k

OpenAI Is Quietly Constructing Your Next Health Assistant

Meta’s chief AI scientist maps his exit

GPT-4V Has Directional Dyslexia

Shows our study based on the WSDM 2023 Toloka VQA Challenge

Visual Query Answering

Toloka’s VQA Challenge recap

GPT-4V and Toloka VQA Challenge

GPT-4V and spatial reasoning

GPT-4V has directional dyslexia

Prompt engineering & directional dyslexia

Conclusions and future work

What are your thoughts on this topic? Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.