Understanding the visual knowledge of language models

-

You’ve likely heard that an image is price a thousand words, but can a big language model (LLM) get the image if it’s never seen images before?

Because it seems, language models which might be trained purely on text have a solid understanding of the visual world. They will write image-rendering code to generate complex scenes with intriguing objects and compositions — and even when that knowledge isn’t used properly, LLMs can refine their images. Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) observed this when prompting language models to self-correct their code for various images, where the systems improved on their easy clipart drawings with each query.

The visual knowledge of those language models is gained from how concepts like shapes and colours are described across the web, whether in language or code. When given a direction like “draw a parrot within the jungle,” users jog the LLM to think about what it’s read in descriptions before. To evaluate how much visual knowledge LLMs have, the CSAIL team constructed a “vision checkup” for LLMs: using their “Visual Aptitude Dataset,” they tested the models’ abilities to attract, recognize, and self-correct these concepts. Collecting each final draft of those illustrations, the researchers trained a pc vision system that identifies the content of real photos.

“We essentially train a vision system without directly using any visual data,” says Tamar Rott Shaham, co-lead creator of the study and an MIT electrical engineering and computer science (EECS) postdoc at CSAIL. “Our team queried language models to jot down image-rendering codes to generate data for us after which trained the vision system to judge natural images. We were inspired by the query of how visual concepts are represented through other mediums, like text. To precise their visual knowledge, LLMs can use code as a standard ground between text and vision.”

To construct this dataset, the researchers first queried the models to generate code for various shapes, objects, and scenes. Then, they compiled that code to render easy digital illustrations, like a row of bicycles, showing that LLMs understand spatial relations well enough to attract the two-wheelers in a horizontal row. As one other example, the model generated a car-shaped cake, combining two random concepts. The language model also produced a glowing light bulb, indicating its ability to create visual effects. 

“Our work shows that if you query an LLM (without multimodal pre-training) to create a picture, it knows way more than it seems,” says co-lead creator, EECS PhD student, and CSAIL member Pratyusha Sharma. “Let’s say you asked it to attract a chair. The model knows other things about this piece of furniture that it might not have immediately rendered, so users can query the model to enhance the visual it produces with each iteration. Surprisingly, the model can iteratively enrich the drawing by improving the rendering code to a major extent.”

The researchers gathered these illustrations, which were then used to coach a pc vision system that may recognize objects inside real photos (despite never having seen one before). With this synthetic, text-generated data as its only reference point, the system outperforms other procedurally generated image datasets that were trained with authentic photos.

The CSAIL team believes that combining the hidden visual knowledge of LLMs with the artistic capabilities of other AI tools like diffusion models is also useful. Systems like Midjourney sometimes lack the know-how to consistently tweak the finer details in a picture, making it difficult for them to handle requests like reducing what number of cars are pictured, or placing an object behind one other. If an LLM sketched out the requested change for the diffusion model beforehand, the resulting edit could possibly be more satisfactory.

The irony, as Rott Shaham and Sharma acknowledge, is that LLMs sometimes fail to acknowledge the identical concepts that they’ll draw. This became clear when the models incorrectly identified human re-creations of images inside the dataset. Such diverse representations of the visual world likely triggered the language models’ misconceptions.

While the models struggled to perceive these abstract depictions, they demonstrated the creativity to attract the identical concepts in another way every time. When the researchers queried LLMs to attract concepts like strawberries and arcades multiple times, they produced pictures from diverse angles with various shapes and colours, hinting that the models might need actual mental imagery of visual concepts (slightly than reciting examples they saw before).

The CSAIL team believes this procedure could possibly be a baseline for evaluating how well a generative AI model can train a pc vision system. Moreover, the researchers look to expand the tasks they challenge language models on. As for his or her recent study, the MIT group notes that they don’t have access to the training set of the LLMs they used, making it difficult to further investigate the origin of their visual knowledge. In the long run, they intend to explore training a fair higher vision model by letting the LLM work directly with it.

Sharma and Rott Shaham are joined on the paper by former CSAIL affiliate Stephanie Fu ’22, MNG ’23 and EECS PhD students Manel Baradad, Adrián Rodríguez-Muñoz ’22, and Shivam Duggal, who’re all CSAIL affiliates; in addition to MIT Associate Professor Phillip Isola and Professor Antonio Torralba. Their work was supported, partly, by a grant from the MIT-IBM Watson AI Lab, a LaCaixa Fellowship, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. They present their paper this week on the IEEE/CVF Computer Vision and Pattern Recognition Conference.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x