How well can your Multimodal model jointly reason over text and image in text-rich scenes?

Models have gotten quite good at understanding text by itself, but what about text in images, which provides necessary contextual information? For instance, navigating a map, or understanding a meme? The power to reason in regards to the interactions between the text and visual context in images can power many real-world applications, equivalent to AI assistants, or tools to help the visually impaired.

We consult with these tasks as “context-sensitive text-rich visual reasoning tasks”.

In the mean time, most evaluations of instruction-tuned large multimodal models (LMMs) deal with testing how well models can reply to human instructions posed as questions or imperative sentences (“Count this”, “List that”, etc) over images… but not how well they understand context-sensitive text-rich scenes!

That’s why we (researchers from University of California Los Angeles) created ConTextual, a Context-sensitive Text-rich visuaL reasoning dataset for evaluating LMMs. We also released a leaderboard, in order that the community can see for themselves which models are one of the best at this task.

For an in-depth dive, it’s also possible to check these additional resources: paper, code, dataset, validation dataset, and leaderboard.

What’s ConTextual

ConTextual is a Context-sensitive Text-rich visual reasoning dataset consisting of 506 difficult instructions for LMM evaluation. We create a various set of instructions on text-rich images with the constraint that they need to require context-sensitive joint reasoning over the textual and visual cues within the image.

It covers 8 real-world visual scenarios – Time Reading, Shopping, Navigation, Abstract Scenes, Mobile Application, Webpages, Infographics and Miscellaneous Natural Scenes. (See the figure for a sample of every dataset).

Each sample consists of:

A text-rich image
A human-written instruction (query or imperative task)
A human-written reference response

The dataset is released in two forms:

(a) a validation set of 100 instances from the entire dataset with instructions, images, and reference answers to the instructions.
(b) a test dataset with instructions and pictures only.

The leaderboard comprises model results each on the validation and test datasets (the knowledge can be present within the paper). The event set allows the practitioners to check and iterate on their approaches easily. The evaluation sandbox is present in our github.

Experiments

For our initial experiments, our benchmark assessed the performance of 13 models. We divided them into three categories:

Augmented LLM approach: GPT4 + visual information in the shape of OCR of the image and/or dense image captions;
Closed-Source LMMs: GPT4V(ision) and Gemini-Vision-Pro;
Open-Source LMMs: LLaVA-v1.5-13B, ShareGPT4V-7B, Instruct-Blip-Vicuna-7B, mPlugOwl-v2-7B, Bliva-Vicuna-7B, Qwen-VL-7B and Idefics-9B.

Our dataset features a reference response for every instruction, allowing us to check various automatic evaluation methods. For evaluation, we use an LLM-as-a-judge approach, and prompt GPT-4 with the instruction, reference response, and predicted response. The model has to return whether the expected response is suitable or not. (GPT4 was chosen because it correlated probably the most with human judgement in our experiments.)

Let’s take a look at some examples!

Example 1
On this instance, GPT-4V provides an incorrect response to the instruction, despite its logical reasoning. The usage of green indicates responses that match the reference, while red highlights errors within the responses. Moreover, a Summarized Reasoning is provided to stipulate the rationale utilized by GPT-4V to reach at its answer.

Example 2
In this instance, GPT-4V accurately responds to the instruction. Nevertheless, ShareGPT-4V-7B (best performing open-source LMM) and GPT-4 w/ Layout-aware OCR + Caption (Augmented LLM) produce a incorrect response, on account of lack of joint reasoning over text and image.

You’ll find more examples like this within the Appendix section of our paper!

Key Takeaways!

While working on this, we found that:

Modern LMMs (proprietary and open models) struggle to perform on ConTextual dataset while humans are good at it, hinting at the potential of model improvement to reinforce reasoning over text-rich images, a website with significant real-world applications.
Proprietary LMMs perform poorly in infographics reasoning that involves time reading, indicating a niche of their capabilities in comparison with humans. Notably, GPT-4V, one of the best performing model, surpasses humans in abstract reasoning, potentially on account of exposure to memes and quotes data, but struggles in time-related tasks where humans excel.
For open-source models equivalent to LLaVA-1.5-13B and ShareGPT-4V-7B, there may be a powerful gap between the domains on which they achieve acceptable human rankings (abstract and natural scene contexts) and the opposite domains ((time-reading, infographics, navigation, shopping, web, and mobile usage). It’s due to this fact likely that most of the domains we cover in our samples are out-of-distribution for these models. Open-source models should due to this fact aim to extend the variety of their training data.
Augmenting an LMMs with a Large Language Model, which receives visual information converted into text via OCR or captions, performs notably badly, with an human approval rate of 17.2%. Our samples need a mix of precise visual perception together with fine-grained nuanced vision-language alignment to be solved.

Our evaluation suggests promising next steps include:

developing enhanced image encoders,
creating highly accurate image descriptions,
facilitating fine-grained vision-language alignment to enhance the model’s perception and mitigate the occurrence of hallucinations.

This, in turn, will result in simpler context-sensitive text-rich visual reasoning.

What’s next?

We’d love to judge your models too, to assist collectively advance the state of vision language models! To submit, please follow our guidelines below.

We hope that this benchmark will assist in developing nuanced vision-language alignment techniques and welcome any sort of collaboration! You possibly can contact us here: Rohan and Hritik, and know more in regards to the team here: Rohan, Hritik, Kai-Wei Chang, Nanyun (Violet) Peng.

Learn how to Submit?

We’re accepting submissions for each the test and validation sets. Please, follow the corresponding procedure below.

Validation Set Submission

To submit your validation results to the leaderboard, you possibly can run our auto-evaluation code (Evaluation Pipeline with GPT4), following these instructions.

We expect submissions to be json format as shown below:

{"model_name": {"img_url": "The boolean rating of your model on the image, 1 for fulfillment and 0 for failure"}}

Replace model name together with your model name (string)
Replace img_url with img_url of the instance (string)
Value for an img url is either 0 or 1 (int)

There ought to be 100 predictions, corresponding to the 100 urls of the val set.

To make the submission please go to the leaderboard hosted on HuggingFace and replenish the Submission form.

Test Set Submission

Once you’re completely satisfied together with your validation results, you possibly can send your model predictions to Rohan and Hritik.

Please include in your email:

A reputation in your model.
Organization (affiliation).
(Optionally) GitHub repo or paper link.

We expect submissions to be json format just like val set as shown below:

{"model_name": {"img_url": "predicted response"}}

Replace model name together with your model name (string)
Replace img_url with img_url of the instance (string)
Value for an img url is the expected response for that instance (string)

There ought to be 506 predictions, corresponding to the 506 urls of the test set.

Source link

How well can your Multimodal model jointly reason over text and image in text-rich scenes?

What’s ConTextual

Experiments

Key Takeaways!

What’s next?

Learn how to Submit?

Validation Set Submission

Test Set Submission

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Introducing Storage Buckets on the Hugging Face Hub

Constructing a Like-for-Like solution for Stores in Power BI

How NVIDIA Builds Open Data for AI

How Joseph Paradiso’s sensing innovations bridge the humanities, medicine, and ecology

NVIDIA RTX Innovations Are Powering the Next Era of Game Development

How well can your Multimodal model jointly reason over text and image in text-rich scenes?

What’s ConTextual

Experiments

Key Takeaways!

What’s next?

Learn how to Submit?

Validation Set Submission

Test Set Submission

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.