MIT researchers have developed a generative artificial intelligence-driven approach for planning long-term visual tasks, like robot navigation, that’s about twice as effective as some existing techniques.
Their method uses a specialized vision-language model to perceive the scenario in a picture and simulate actions needed to succeed in a goal. Then a second model translates those simulations into an ordinary programming language for planning problems, and refines the answer.
Ultimately, the system mechanically generates a set of files that might be fed into classical planning software, which computes a plan to realize the goal. This two-step system generated plans with a median success rate of about 70 percent, outperforming one of the best baseline methods that might only reach about 30 percent.
Importantly, the system can solve recent problems it hasn’t encountered before, making it well-suited for real environments where conditions can change at a moment’s notice.
“Our framework combines the benefits of vision-language models, like their ability to know images, with the strong planning capabilities of a proper solver,” says Yilun Hao, an aeronautics and astronautics (AeroAstro) graduate student at MIT and lead creator of an open-access paper on this system. “It will probably take a single image and move it through simulation after which to a reliable, long-horizon plan that may very well be useful in lots of real-life applications.”
She is joined on the paper by Yongchao Chen, a graduate student within the MIT Laboratory for Information and Decision Systems (LIDS); Chuchu Fan, an associate professor in AeroAstro and a principal investigator in LIDS; and Yang Zhang, a research scientist on the MIT-IBM Watson AI Lab. The paper will likely be presented on the International Conference on Learning Representations.
Tackling visual tasks
For the past few years, Fan and her colleagues have studied using generative AI models to perform complex reasoning and planning, often employing large language models (LLMs) to process text inputs.
Many real-world planning problems, like robotic assembly and autonomous driving, have visual inputs that an LLM can’t handle well by itself. The researchers sought to expand into the visual domain by utilizing vision-language models (VLMs), powerful AI systems that may process images and text.
But VLMs struggle to know spatial relationships between objects in a scene and infrequently fail to reason appropriately over many steps. This makes it difficult to make use of VLMs for long-range planning.
Then again, scientists have developed robust, formal planners that may generate effective long-horizon plans for complex situations. Nonetheless, these software systems can’t process visual inputs and require expert knowledge to encode an issue into language the solver can understand.
Fan and her team built an automatic planning system that takes one of the best of each methods. The system, called VLM-guided formal planning (VLMFP), utilizes two specialized VLMs that work together to show visual planning problems into ready-to-use files for formal planning software.
The researchers first fastidiously trained a small model they call SimVLM to concentrate on describing the scenario in a picture using natural language and simulating a sequence of actions in that scenario. Then a much larger model, which they call GenVLM, uses the outline from SimVLM to generate a set of initial files in a proper planning language referred to as the Planning Domain Definition Language (PDDL).
The files are able to be fed right into a classical PDDL solver, which computes a step-by-step plan to resolve the duty. GenVLM compares the outcomes of the solver with those of the simulator and iteratively refines the PDDL files.
“The generator and simulator work together to give you the chance to succeed in the very same result, which is an motion simulation that achieves the goal,” Hao says.
Because GenVLM is a big generative AI model, it has seen many examples of PDDL during training and learned how this formal language can solve a wide selection of problems. This existing knowledge enables the model to generate accurate PDDL files.
A versatile approach
VLMFP generates two separate PDDL files. The primary is a website file that defines the environment, valid actions, and domain rules. It also produces an issue file that defines the initial states and the goal of a selected problem at hand.
“One advantage of PDDL is the domain file is identical for all instances in that environment. This makes our framework good at generalizing to unseen instances under the identical domain,” Hao explains.
To enable the system to generalize effectively, the researchers needed to fastidiously design barely enough training data for SimVLM so the model learned to know the issue and goal without memorizing patterns within the scenario. When tested, SimVLM successfully described the scenario, simulated actions, and detected if the goal was reached in about 85 percent of experiments.
Overall, the VLMFP framework achieved a hit rate of about 60 percent on six 2D planning tasks and greater than 80 percent on two 3D tasks, including multirobot collaboration and robotic assembly. It also generated valid plans for greater than 50 percent of scenarios it hadn’t seen before, far outpacing the baseline methods.
“Our framework can generalize when the principles change in numerous situations. This provides our system the flexibleness to resolve many kinds of visual-based planning problems,” Fan adds.
In the long run, the researchers need to enable VLMFP to handle more complex scenarios and explore methods to discover and mitigate hallucinations by the VLMs.
“In the long run, generative AI models could act as agents and make use of the precise tools to resolve rather more complicated problems. But what does it mean to have the precise tools, and the way can we incorporate those tools? There continues to be a protracted solution to go, but by bringing visual-based planning into the image, this work is a vital piece of the puzzle,” Fan says.
This work was funded, partly, by the MIT-IBM Watson AI Lab.
