Teaching AI models the broad strokes to sketch more like humans do

-

While you’re trying to speak or understand ideas, words don’t all the time do the trick. Sometimes the more efficient approach is to do a straightforward sketch of that idea — for instance, diagramming a circuit might help make sense of how the system works.

But what if artificial intelligence could help us explore these visualizations? While these systems are typically proficient at creating realistic paintings and cartoonish drawings, many models fail to capture the essence of sketching: its stroke-by-stroke, iterative process, which helps humans brainstorm and edit how they need to represent their ideas.

A brand new drawing system from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University can sketch more like we do. Their method, called “SketchAgent,” uses a multimodal language model — AI systems that train on text and pictures, like Anthropic’s Claude 3.5 Sonnet — to show natural language prompts into sketches in just a few seconds. For instance, it will probably doodle a house either by itself or through collaboration, drawing with a human or incorporating text-based input to sketch each part individually.

The researchers showed that SketchAgent can create abstract drawings of diverse concepts, like a robot, butterfly, DNA helix, flowchart, and even the Sydney Opera House. In the future, the tool could possibly be expanded into an interactive art game that helps teachers and researchers diagram complex concepts or give users a fast drawing lesson.

CSAIL postdoc Yael Vinker, who’s the lead writer of a paper introducing SketchAgent, notes that the system introduces a more natural way for humans to speak with AI.

“Not everyone seems to be aware of how much they draw of their each day life. We may draw our thoughts or workshop ideas with sketches,” she says. “Our tool goals to emulate that process, making multimodal language models more useful in helping us visually express ideas.”

SketchAgent teaches these models to attract stroke-by-stroke without training on any data — as an alternative, the researchers developed a “sketching language” through which a sketch is translated right into a numbered sequence of strokes on a grid. The system was given an example of how things like a house could be drawn, with each stroke labeled in keeping with what it represented — corresponding to the seventh stroke being a rectangle labeled as a “front door” — to assist the model generalize to latest concepts.

Vinker wrote the paper alongside three CSAIL affiliates — postdoc Tamar Rott Shaham, undergraduate researcher Alex Zhao, and MIT Professor Antonio Torralba — in addition to Stanford University Research Fellow Kristine Zheng and Assistant Professor Judith Ellen Fan. They’ll present their work on the 2025 Conference on Computer Vision and Pattern Recognition (CVPR) this month.

Assessing AI’s sketching abilities

While text-to-image models corresponding to DALL-E 3 can create intriguing drawings, they lack a vital component of sketching: the spontaneous, creative process where each stroke can impact the general design. Then again, SketchAgent’s drawings are modeled as a sequence of strokes, appearing more natural and fluid, like human sketches.

Prior works have mimicked this process, too, but they trained their models on human-drawn datasets, which are sometimes limited in scale and variety. SketchAgent uses pre-trained language models as an alternative, that are knowledgeable about many concepts, but don’t know how one can sketch. When the researchers taught language models this process, SketchAgent began to sketch diverse concepts it hadn’t explicitly trained on.

Still, Vinker and her colleagues desired to see if SketchAgent was actively working with humans on the sketching process, or if it was working independently of its drawing partner. The team tested their system in collaboration mode, where a human and a language model work toward drawing a selected concept in tandem. Removing SketchAgent’s contributions revealed that their tool’s strokes were essential to the ultimate drawing. In a drawing of a sailboat, as an example, removing the factitious strokes representing a mast made the general sketch unrecognizable.

In one other experiment, CSAIL and Stanford researchers plugged different multimodal language models into SketchAgent to see which could create probably the most recognizable sketches. Their default backbone model, Claude 3.5 Sonnet, generated probably the most human-like vector graphics (essentially text-based files that will be converted into high-resolution images). It outperformed models like GPT-4o and Claude 3 Opus.

“The indisputable fact that Claude 3.5 Sonnet outperformed other models like GPT-4o and Claude 3 Opus suggests that this model processes and generates visual-related information in a different way,” says co-author Tamar Rott Shaham.

She adds that SketchAgent could turn into a helpful interface for collaborating with AI models beyond standard, text-based communication. “As models advance in understanding and generating other modalities, like sketches, they open up latest ways for users to specific ideas and receive responses that feel more intuitive and human-like,” says Rott Shaham. “This might significantly enrich interactions, making AI more accessible and versatile.”

While SketchAgent’s drawing prowess is promising, it will probably’t make skilled sketches yet. It renders easy representations of concepts using stick figures and doodles, but struggles to doodle things like logos, sentences, complex creatures like unicorns and cows, and specific human figures.

At times, their model also misunderstood users’ intentions in collaborative drawings, like when SketchAgent drew a bunny with two heads. In response to Vinker, this will be since the model breaks down each task into smaller steps (also called “Chain of Thought” reasoning). When working with humans, the model creates a drawing plan, potentially misinterpreting which a part of that outline a human is contributing to. The researchers could possibly refine these drawing skills by training on synthetic data from diffusion models.

Moreover, SketchAgent often requires just a few rounds of prompting to generate human-like doodles. In the long run, the team goals to make it easier to interact and sketch with multimodal language models, including refining their interface. 

Still, the tool suggests AI could draw diverse concepts the way in which humans do, with step-by-step human-AI collaboration that ends in more aligned final designs.

This work was supported, partially, by the U.S. National Science Foundation, a Hoffman-Yee Grant from the Stanford Institute for Human-Centered AI, the Hyundai Motor Co., the U.S. Army Research Laboratory, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x