Within the classic cartoon “The Jetsons,” Rosie the robotic maid seamlessly switches from vacuuming the home to cooking dinner to taking out the trash. But in real life, training a general-purpose robot stays a serious challenge.
Typically, engineers collect data which might be specific to a certain robot and task, which they use to coach the robot in a controlled environment. Nevertheless, gathering these data is expensive and time-consuming, and the robot will likely struggle to adapt to environments or tasks it hasn’t seen before.
To coach higher general-purpose robots, MIT researchers developed a flexible technique that mixes an enormous amount of heterogeneous data from lots of sources into one system that may teach any robot a wide selection of tasks.
Their method involves aligning data from varied domains, like simulations and real robots, and multiple modalities, including vision sensors and robotic arm position encoders, right into a shared “language” that a generative AI model can process.
By combining such an unlimited amount of knowledge, this approach may be used to coach a robot to perform quite a lot of tasks without the necessity to begin training it from scratch every time.
This method might be faster and inexpensive than traditional techniques since it requires far fewer task-specific data. As well as, it outperformed training from scratch by greater than 20 percent in simulation and real-world experiments.
“In robotics, people often claim that we don’t have enough training data. But for my part, one other big problem is that the info come from so many various domains, modalities, and robot hardware. Our work shows the way you’d give you the option to coach a robot with all of them put together,” says Lirui Wang, an electrical engineering and computer science (EECS) graduate student and lead writer of a paper on this method.
Wang’s co-authors include fellow EECS graduate student Jialiang Zhao; Xinlei Chen, a research scientist at Meta; and senior writer Kaiming He, an associate professor in EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). The research might be presented on the Conference on Neural Information Processing Systems.
Inspired by LLMs
A robotic “policy” takes in sensor observations, like camera images or proprioceptive measurements that track the speed and position a robotic arm, after which tells a robot how and where to maneuver.
Policies are typically trained using imitation learning, meaning a human demonstrates actions or teleoperates a robot to generate data, that are fed into an AI model that learns the policy. Because this method uses a small amount of task-specific data, robots often fail when their environment or task changes.
To develop a greater approach, Wang and his collaborators drew inspiration from large language models like GPT-4.
These models are pretrained using an unlimited amount of diverse language data after which fine-tuned by feeding them a small amount of task-specific data. Pretraining on a lot data helps the models adapt to perform well on quite a lot of tasks.
“Within the language domain, the info are all just sentences. In robotics, given all of the heterogeneity in the info, if you desire to pretrain in the same manner, we’d like a special architecture,” he says.
Robotic data take many forms, from camera images to language instructions to depth maps. At the identical time, each robot is mechanically unique, with a special number and orientation of arms, grippers, and sensors. Plus, the environments where data are collected vary widely.
The MIT researchers developed a brand new architecture called Heterogeneous Pretrained Transformers (HPT) that unifies data from these varied modalities and domains.
They put a machine-learning model referred to as a transformer into the center of their architecture, which processes vision and proprioception inputs. A transformer is identical form of model that forms the backbone of huge language models.
The researchers align data from vision and proprioception into the identical form of input, called a token, which the transformer can process. Each input is represented with the identical fixed variety of tokens.
Then the transformer maps all inputs into one shared space, growing right into a huge, pretrained model because it processes and learns from more data. The larger the transformer becomes, the higher it would perform.
A user only must feed HPT a small amount of knowledge on their robot’s design, setup, and the duty they need it to perform. Then HPT transfers the knowledge the transformer grained during pretraining to learn the brand new task.
Enabling dexterous motions
One in all the largest challenges of developing HPT was constructing the huge dataset to pretrain the transformer, which included 52 datasets with greater than 200,000 robot trajectories in 4 categories, including human demo videos and simulation.
The researchers also needed to develop an efficient approach to turn raw proprioception signals from an array of sensors into data the transformer could handle.
“Proprioception is essential to enable loads of dexterous motions. Since the variety of tokens is in our architecture all the time the identical, we place the identical importance on proprioception and vision,” Wang explains.
Once they tested HPT, it improved robot performance by greater than 20 percent on simulation and real-world tasks, compared with training from scratch every time. Even when the duty was very different from the pretraining data, HPT still improved performance.
“This paper provides a novel approach to training a single policy across multiple robot embodiments. This permits training across diverse datasets, enabling robot learning methods to significantly scale up the scale of datasets that they’ll train on. It also allows the model to quickly adapt to latest robot embodiments, which is vital as latest robot designs are repeatedly being produced,” says David Held, associate professor on the Carnegie Mellon University Robotics Institute, who was not involved with this work.
In the long run, the researchers want to review how data diversity could boost the performance of HPT. In addition they want to boost HPT so it could process unlabeled data like GPT-4 and other large language models.
“Our dream is to have a universal robot brain that you could possibly download and use to your robot with none training in any respect. While we are only within the early stages, we’re going to keep pushing hard and hope scaling results in a breakthrough in robotic policies, prefer it did with large language models,” he says.
This work was funded, partially, by the Amazon Greater Boston Tech Initiative and the Toyota Research Institute.