In the present AI zeitgeist, sequence models have skyrocketed in popularity for his or her ability to research data and predict what to do next. As an illustration, you’ve likely used next-token prediction models like ChatGPT, which anticipate each word (token) in a sequence to form answers to users’ queries. There are also full-sequence diffusion models like Sora, which convert words into dazzling, realistic visuals by successively “denoising” a complete video sequence.
Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have proposed a straightforward change to the diffusion training scheme that makes this sequence denoising considerably more flexible.
When applied to fields like computer vision and robotics, the next-token and full-sequence diffusion models have capability trade-offs. Next-token models can spit out sequences that modify in length. Nevertheless, they make these generations while being unaware of desirable states within the far future — comparable to steering its sequence generation toward a certain goal 10 tokens away — and thus require additional mechanisms for long-horizon (long-term) planning. Diffusion models can perform such future-conditioned sampling, but lack the flexibility of next-token models to generate variable-length sequences.
Researchers from CSAIL wish to mix the strengths of each models, so that they created a sequence model training technique called “Diffusion Forcing.” The name comes from “Teacher Forcing,” the traditional training scheme that breaks down full sequence generation into the smaller, easier steps of next-token generation (very like a very good teacher simplifying a posh concept).
Diffusion Forcing found common ground between diffusion models and teacher forcing: They each use training schemes that involve predicting masked (noisy) tokens from unmasked ones. Within the case of diffusion models, they steadily add noise to data, which will be viewed as fractional masking. The MIT researchers’ Diffusion Forcing method trains neural networks to cleanse a set of tokens, removing different amounts of noise inside each while concurrently predicting the following few tokens. The result: a versatile, reliable sequence model that resulted in higher-quality artificial videos and more precise decision-making for robots and AI agents.
By sorting through noisy data and reliably predicting the following steps in a task, Diffusion Forcing can aid a robot in ignoring visual distractions to finish manipulation tasks. It could also generate stable and consistent video sequences and even guide an AI agent through digital mazes. This method could potentially enable household and factory robots to generalize to latest tasks and improve AI-generated entertainment.
“Sequence models aim to condition on the known past and predict the unknown future, a variety of binary masking. Nevertheless, masking doesn’t have to be binary,” says lead creator, MIT electrical engineering and computer science (EECS) PhD student, and CSAIL member Boyuan Chen. “With Diffusion Forcing, we add different levels of noise to every token, effectively serving as a variety of fractional masking. At test time, our system can “unmask” a set of tokens and diffuse a sequence within the near future at a lower noise level. It knows what to trust inside its data to beat out-of-distribution inputs.”
In several experiments, Diffusion Forcing thrived at ignoring misleading data to execute tasks while anticipating future actions.
When implemented right into a robotic arm, for instance, it helped swap two toy fruits across three circular mats, a minimal example of a family of long-horizon tasks that require memories. The researchers trained the robot by controlling it from a distance (or teleoperating it) in virtual reality. The robot is trained to mimic the user’s movements from its camera. Despite ranging from random positions and seeing distractions like a shopping bag blocking the markers, it placed the objects into its goal spots.
To generate videos, they trained Diffusion Forcing on “Minecraft” game play and colourful digital environments created inside Google’s DeepMind Lab Simulator. When given a single frame of footage, the tactic produced more stable, higher-resolution videos than comparable baselines like a Sora-like full-sequence diffusion model and ChatGPT-like next-token models. These approaches created videos that appeared inconsistent, with the latter sometimes failing to generate working video past just 72 frames.
Diffusion Forcing not only generates fancy videos, but also can function a motion planner that steers toward desired outcomes or rewards. Because of its flexibility, Diffusion Forcing can uniquely generate plans with various horizon, perform tree search, and incorporate the intuition that the distant future is more uncertain than the near future. Within the task of solving a 2D maze, Diffusion Forcing outperformed six baselines by generating faster plans resulting in the goal location, indicating that it could possibly be an efficient planner for robots in the long run.
Across each demo, Diffusion Forcing acted as a full sequence model, a next-token prediction model, or each. In keeping with Chen, this versatile approach could potentially function a robust backbone for a “world model,” an AI system that may simulate the dynamics of the world by training on billions of web videos. This might allow robots to perform novel tasks by imagining what they should do based on their surroundings. For instance, in case you asked a robot to open a door without being trained on how one can do it, the model could produce a video that’ll show the machine how one can do it.
The team is currently trying to scale up their method to larger datasets and the newest transformer models to enhance performance. They intend to broaden their work to construct a ChatGPT-like robot brain that helps robots perform tasks in latest environments without human demonstration.
“With Diffusion Forcing, we’re taking a step to bringing video generation and robotics closer together,” says senior creator Vincent Sitzmann, MIT assistant professor and member of CSAIL, where he leads the Scene Representation group. “Ultimately, we hope that we are able to use all of the knowledge stored in videos on the web to enable robots to assist in on a regular basis life. Many more exciting research challenges remain, like how robots can learn to mimic humans by watching them even when their very own bodies are so different from our own!”
Chen and Sitzmann wrote the paper alongside recent MIT visiting researcher Diego Martí Monsó, and CSAIL affiliates: Yilun Du, a EECS graduate student; Max Simchowitz, former postdoc and incoming Carnegie Mellon University assistant professor; and Russ Tedrake, the Toyota Professor of EECS, Aeronautics and Astronautics, and Mechanical Engineering at MIT, vice chairman of robotics research on the Toyota Research Institute, and CSAIL member. Their work was supported, partially, by the U.S. National Science Foundation, the Singapore Defence Science and Technology Agency, Intelligence Advanced Research Projects Activity via the U.S. Department of the Interior, and the Amazon Science Hub. They are going to present their research at NeurIPS in December.