Multiple AI models help robots execute complex plans more transparently

Artificial Intelligence

Multiple AI models help robots execute complex plans more transparently

admin

January 9, 2024

Multiple AI models help robots execute complex plans more transparently

Your every day to-do list is probably going pretty straightforward: wash the dishes, buy groceries, and other minutiae. It’s unlikely you wrote out “pick up the primary dirty dish,” or “wash that plate with a sponge,” because each of those miniature steps throughout the chore feels intuitive. While we will routinely complete each step without much thought, a robot requires a posh plan that involves more detailed outlines.

MIT’s Improbable AI Lab, a bunch throughout the Computer Science and Artificial Intelligence Laboratory (CSAIL), has offered these machines a helping hand with a latest multimodal framework: Compositional Foundation Models for Hierarchical Planning (HiP), which develops detailed, feasible plans with the expertise of three different foundation models. Like OpenAI’s GPT-4, the muse model that ChatGPT and Bing Chat were built upon, these foundation models are trained on massive quantities of information for applications like generating images, translating text, and robotics.

Unlike RT2 and other multimodal models which are trained on paired vision, language, and motion data, HiP uses three different foundation models each trained on different data modalities. Each foundation model captures a distinct a part of the decision-making process after which works together when it’s time to make decisions. HiP removes the necessity for access to paired vision, language, and motion data, which is difficult to acquire. HiP also makes the reasoning process more transparent.

What’s considered a every day chore for a human could be a robot’s “long-horizon goal” — an overarching objective that involves completing many smaller steps first — requiring sufficient data to plan, understand, and execute objectives. While computer vision researchers have attempted to construct monolithic foundation models for this problem, pairing language, visual, and motion data is dear. As an alternative, HiP represents a distinct, multimodal recipe: a trio that cheaply incorporates linguistic, physical, and environmental intelligence right into a robot.

“Foundation models should not have to be monolithic,” says NVIDIA AI researcher Jim Fan, who was not involved within the paper. “This work decomposes the complex task of embodied agent planning into three constituent models: a language reasoner, a visible world model, and an motion planner. It makes a difficult decision-making problem more tractable and transparent.”

The team believes that their system could help these machines accomplish household chores, resembling putting away a book or placing a bowl within the dishwasher. Moreover, HiP could assist with multistep construction and manufacturing tasks, like stacking and placing different materials in specific sequences.

Evaluating HiP

The CSAIL team tested HiP’s acuity on three manipulation tasks, outperforming comparable frameworks. The system reasoned by developing intelligent plans that adapt to latest information.

First, the researchers requested that it stack different-colored blocks on one another after which place others nearby. The catch: Among the correct colours weren’t present, so the robot had to position white blocks in a color bowl to color them. HiP often adjusted to those changes accurately, especially in comparison with state-of-the-art task planning systems like Transformer BC and Motion Diffuser, by adjusting its plans to stack and place each square as needed.

One other test: arranging objects resembling candy and a hammer in a brown box while ignoring other items. Among the objects it needed to maneuver were dirty, so HiP adjusted its plans to position them in a cleansing box, after which into the brown container. In a 3rd demonstration, the bot was in a position to ignore unnecessary objects to finish kitchen sub-goals resembling opening a microwave, clearing a kettle out of the way in which, and turning on a lightweight. Among the prompted steps had already been accomplished, so the robot adapted by skipping those directions.

A 3-pronged hierarchy

HiP’s three-pronged planning process operates as a hierarchy, with the flexibility to pre-train each of its components on different sets of information, including information outside of robotics. At the underside of that order is a big language model (LLM), which starts to ideate by capturing all of the symbolic information needed and developing an abstract task plan. Applying the common sense knowledge it finds on the web, the model breaks its objective into sub-goals. For instance, “making a cup of tea” turns into “filling a pot with water,” “boiling the pot,” and the next actions required.

“All we would like to do is take existing pre-trained models and have them successfully interface with one another,” says Anurag Ajay, a PhD student within the MIT Department of Electrical Engineering and Computer Science (EECS) and a CSAIL affiliate. “As an alternative of pushing for one model to do all the pieces, we mix multiple ones that leverage different modalities of web data. When utilized in tandem, they assist with robotic decision-making and may potentially aid with tasks in homes, factories, and construction sites.”

These models also need some type of “eyes” to know the environment they’re operating in and accurately execute each sub-goal. The team used a big video diffusion model to enhance the initial planning accomplished by the LLM, which collects geometric and physical information in regards to the world from footage on the web. In turn, the video model generates an commentary trajectory plan, refining the LLM’s outline to include latest physical knowledge.

This process, generally known as iterative refinement, allows HiP to reason about its ideas, taking in feedback at each stage to generate a more practical outline. The flow of feedback is analogous to writing an article, where an writer may send their draft to an editor, and with those revisions incorporated in, the publisher reviews for any last changes and finalizes.

On this case, the highest of the hierarchy is an egocentric motion model, or a sequence of first-person images that infer which actions should happen based on its surroundings. During this stage, the commentary plan from the video model is mapped over the space visible to the robot, helping the machine determine tips on how to execute each task throughout the long-horizon goal. If a robot uses HiP to make tea, this implies it can have mapped out exactly where the pot, sink, and other key visual elements are, and start completing each sub-goal.

Still, the multimodal work is restricted by the shortage of high-quality video foundation models. Once available, they might interface with HiP’s small-scale video models to further enhance visual sequence prediction and robot motion generation. A better-quality version would also reduce the present data requirements of the video models.

That being said, the CSAIL team’s approach only used a tiny bit of information overall. Furthermore, HiP was low-cost to coach and demonstrated the potential of using available foundation models to finish long-horizon tasks. “What Anurag has demonstrated is proof-of-concept of how we will take models trained on separate tasks and data modalities and mix them into models for robotic planning. In the long run, HiP could possibly be augmented with pre-trained models that may process touch and sound to make higher plans,” says senior writer Pulkit Agrawal, MIT assistant professor in EECS and director of the Improbable AI Lab. The group can be considering applying HiP to solving real-world long-horizon tasks in robotics.

Ajay and Agrawal are lead authors on a paper describing the work. They’re joined by MIT professors and CSAIL principal investigators Tommi Jaakkola, Joshua Tenenbaum, and Leslie Pack Kaelbling; CSAIL research affiliate and MIT-IBM AI Lab research manager Akash Srivastava; graduate students Seungwook Han and Yilun Du ’19; former postdoc Abhishek Gupta, who’s now assistant professor at University of Washington; and former graduate student Shuang Li PhD ’23.

The team’s work was supported, partially, by the National Science Foundation, the U.S. Defense Advanced Research Projects Agency, the U.S. Army Research Office, the U.S. Office of Naval Research Multidisciplinary University Research Initiatives, and the MIT-IBM Watson AI Lab. Their findings were presented on the 2023 Conference on Neural Information Processing Systems (NeurIPS).

LEAVE A REPLY Cancel reply