Home Artificial Intelligence Improving AI’s ability to follow complex human instructions

Improving AI’s ability to follow complex human instructions

1
Improving AI’s ability to follow complex human instructions

A paper on vision and language navigation by a team of researchers including CDS PhD student Aishwarya Kamath was accepted at CVPR 2023

CDS PhD student Aishwarya Kamath

Training robots to follow human instructions in natural language is a major long-term challenge in artificial intelligence research. To explore this query, the sector of Vision and Language Navigation (VLN) has developed to coach VLN models that navigate in photorealistic environments. On account of an absence of diversity in training environments and scarce human instruction data, these models still face limitations. In a recent paper, “A Recent Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning”, the authors present a recent dataset two orders of magnitude larger than existing human annotated datasets with VLN instructions near the standard of human instructions.

Led by CDS PhD student Aishwarya Kamath, the paper was accepted on the IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) which can be held on the Vancouver Convention Center from June 18th through the twenty second. Along with Aishwarya, the team consists of Senior Research Scientist at Google Research Peter Anderson, Software Engineer at Google AI Language Su Wang, PhD student at Carnegie Mellon University Jing Yu Koh, Software Engineer at Google Research Alexander Ku, Research Scientist at Google Research Austin Waters, Research Scientist at Apple AI/ML Yinfei Yang, Research Scientist at Google Research Jason Baldridge, and Software Engineer at Google AI Zarana Parekh.

While pre-training transformer models on generic image text data have been explored with limited success, on this work the authors explore pre-training on diverse in-domain instruction following data. The researchers use greater than 500 indoor environments captured in 360-degree panoramas to develop a novel dataset. They construct navigation trajectories using panoramas from the previously unexplored Gibson dataset and generate visually grounded instructions for every trajectory using Marky (a multilingual navigation instruction generator). Through the use of a various set of environments and augmenting them with novel viewpoints, the authors train a pure imitation learning agent that’s capable of scale efficiently, making it possible to effectively utilize the created dataset. The resulting agent outperformed all other existing reinforcement learning agents on the difficult VLN benchmark — Room across Room (RxR).

“This project was of interest to me because I used to be curious to know whether generic architectures and training strategies can achieve competitive performance on the VLN task, opening up the likelihood to have one model for all vision and language tasks, including embodied ones like VLN,” said Kamath. “We found that scaling up in-domain data is crucial and coupling this with an imitation learning agent can achieve state-of-the-art results with none reinforcement learning.”

The study opens recent avenues to enhance AI’s capability to follow complex human instructions. “This result paves a recent path towards improving instruction-following agents, emphasizing large-scale imitation learning with generic architectures, together with a deal with developing synthetic instruction generation capabilities — that are shown to directly improve instruction-following performance,” write the authors.

By Meryl Phair

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here