R²D²: Improving Robot Manipulation with Simulation and Language Models

-


Robot manipulation systems struggle with changing objects, lighting, and get in touch with dynamics after they move into dynamic real-world environments. On top of this, gaps between simulation and reality, and non-optimized grippers or tools often limit how reliably robots can generalize, execute long-horizon tasks, and achieve human-level dexterity across diverse tasks.

This edition of NVIDIA Robotics Research and Development Digest (R²D²) explores novel approaches to improving robot manipulation skills. On this blog, we’ll discuss three research efforts that use reasoning LLMs, sim-and-real co-training, and VLMs for designing tools for manipulation:

We’ll also cover how robot manipulation could be improved using data augmentation and other recipes from the Cosmos Cookbook. This cookbook is an open-source resource that features examples of real-world applications of NVIDIA Cosmos for robotics and autonomous driving.

Improving robot reasoning and motion execution with ThinkAct

In robotics, vision-language-action (VLA) models generate robot actions from multimodal instructions, like vision and natural language. A strong VLA should give you the option to know and output complex, multi-step actions in dynamic environments. Current approaches to robot manipulation train end-to-end VLAs without an explicit reasoning step. This makes it difficult for VLAs to plan long-horizon tasks and to adapt to various tasks and environments. 

ThinkAct reduces this gap by integrating high-level reasoning with low-level action-execution in a dual-system framework. This “considering before acting” framework is implemented via reinforced visual latent planning.

First, a multimodal large language model (MLLM) is trained to generate reasoning plans for a robot to follow. These plans are created using reinforcement learning, where visual rewards encourage the MLLM to make plans that result in goal completion by following physically realistic trajectories. To do that, ThinkAct uses human and robot videos to perform reasoning based on visual observations. Training in this manner ensures that the robot’s planning will not be only theoretically correct but additionally physically possible based on visual feedback. That is the “Think” part.

Now onto the “Act” part. Intermediate steps in a reasoning plan are compressed right into a compact latent trajectory. This representation comprises essential intent and context from the plan. The latent trajectory then guides a separate motion model, enabling the robot to execute actions in diverse environments. In this manner, high-level reasoning informs and improves low-level robot actions in real-world scenarios.

iagram detailing the “thinking before acting” framework. The model uses action-aligned visual feedback and an LLM-based thinking module (few-shot adaptation, long-horizon planning, self-correction) to guide robot acting.iagram detailing the “thinking before acting” framework. The model uses action-aligned visual feedback and an LLM-based thinking module (few-shot adaptation, long-horizon planning, self-correction) to guide robot acting.
Figure 1. Overview of ThinkAct.

ThinkAct has been tested on robot manipulation and embodied reasoning benchmarks. It successfully performs few-shot deployment, long-horizon manipulation, and self-correction in embodied AI tasks.

A four-panel diagram titled "Simpler-Google," visualizing a long-horizon manipulation task where a robot reasons to move a soda can near an apple, showing the flow from input to reasoning, visual trajectory, and action execution.A four-panel diagram titled "Simpler-Google," visualizing a long-horizon manipulation task where a robot reasons to move a soda can near an apple, showing the flow from input to reasoning, visual trajectory, and action execution.
Figure 2. Visualization of a long-horizon manipulation task.

Co-training with the Sim-and-Real Policy

Training robots to perform manipulation tasks requires collecting data across diverse tasks, environments, and object configurations. A normal way of doing that is via behavior cloning, where expert demonstrations are captured in the actual world. This sounds good in theory, however it’s expensive and doesn’t scale practically. Real-world data collection requires human operators to manually generate demonstrations or monitor robots, which is slow and limited by availability of robot hardware. 

An answer is to gather demonstrations in simulation, which could be automated and parallelized to make data collection fast and simple. Nonetheless, policies trained on simulation data don’t at all times transfer well to the actual world. That is the sim-to-real gap that’s observed because simulations cannot perfectly replicate complexities of real-world physics, dynamics, noise, and feedback.     

The sim-and-real policy co-training work bridges this gap through the use of each simulation and just a few real-world demonstrations to learn generalizable manipulation policies. It is a unified sim-and-real co-training framework that learns a shared latent space where observations from simulation and the actual world are aligned. It builds on the work presented in sim-and-real co-training and uses a greater representation space for alignment. The representation also captures action-related information. The major idea is to align observations and their corresponding actions, in order that the policy learns behaviors that work in each simulated and real settings.  

These representations are learned via a way called optimal transport (OT). OT helps policies detect similar patterns in simulation and real-world data in order that the knowledge needed for selecting actions stays the identical, no matter whether the input is simulated or real. There’s normally so much more simulated data than real data, so this data imbalance is handled by expanding to an unbalanced OT (UOT) framework. UOT uses a sampling method that makes training more practical even when the datasets are different sizes.

Overview of sim-and-real policy co-training using Optimal Transport (OT), demonstrating how large, diverse simulation data is aligned with sparse real-world data to learn a robust, shared policy for effective robot deployment and generalization.Overview of sim-and-real policy co-training using Optimal Transport (OT), demonstrating how large, diverse simulation data is aligned with sparse real-world data to learn a robust, shared policy for effective robot deployment and generalization.
Figure 3. Overview of sim-and-real policy co-training using OT.

Policies trained using this framework successfully generalize real-world scenarios, even when those scenarios appeared in only the simulated a part of the training data. Each sim-to-sim and sim-to-real transfer was evaluated across robot manipulation tasks like lifting, cube stacking, and placing a box in a bin.

A video demonstrating a robotic arm performing a "Drawer Task," where it sorts objects into a closed drawer.A video demonstrating a robotic arm performing a "Drawer Task," where it sorts objects into a closed drawer.
Figure 4. Using sim-and-real co-training, the policy learns long-horizon tasks, like sorting objects right into a closed drawer, from as few as 25 demonstrations.

Most robot manipulation tasks involve using different tools and objects. Using tools is a mandatory capability in robots to interact with their environments and perform complex actions. The issue is that tools designed for humans are difficult for robots to handle on account of various and complicated form aspects. Current approaches to robot tool design use predefined templates that aren’t customizable or 3D generation methods that aren’t optimized for this purpose.  

RobotSmith solves this challenge by providing an automatic tool design framework using vision-language models (VLMs). VLMs are good at reasoning about 3D space and physical interactions, and understanding what actions could be performed by a robot with different objects. These key capabilities make VLMs very useful in effective tool design.  

RobotSmith integrates this prior knowledge from VLMs with a joint optimization process in simulation to generate task-specific tools. The three core components are:

  1. Critic Tool Designer: Two VLM agents collaborate to generate candidate tool geometries.
  2. Tool Use Planner: Generates a manipulation trajectory based on the designed tool and scene. Candidate trajectories and grasps are executed and evaluated in simulation.
  3. Joint Optimizer: Tool geometry and trajectory parameters are jointly fine-tuned in simulation to maximise performance. This is vital to eliminate suboptimal tool and trajectory pairs that will end in failed tasks.   

In this manner, RobotSmith generates diverse tool designs for tasks like pushing, scooping or enclosing.

Robot arm in simulation iterating through tool designs to solve a task. Robot arm in simulation iterating through tool designs to solve a task.
Figure 5. RobotSmith iterates through tool designs, identifies an efficient design, and generates a trajectory using the designed tool to attain the user’s task.

RobotSmith was evaluated in simulation and on real-world tasks. Find the whole list of experiments and leads to the paper. One real-world test was making a pancake, for which the framework designed and used distinct tools for every step like flattening, scooping, and spreading dough. This demonstrated the framework’s ability to successfully perform long-horizon tasks.

Side-by-side comparison rows of robotics arms labeled Real and Sim, showing sequential steps: initial setup, flatten dough, scoop sauce, spread sauce, and sprinkle sesame, followed by the before/after baking result.Side-by-side comparison rows of robotics arms labeled Real and Sim, showing sequential steps: initial setup, flatten dough, scoop sauce, spread sauce, and sprinkle sesame, followed by the before/after baking result.
Figure 6. RobotSmith designs and uses tools optimized for every subtask in a long-horizon manipulation scenario.

Bridging the sim-to-real gap via the NVIDIA Cosmos Cookbook

We talked concerning the sim-to-real gap earlier on this blog, and discussed how synthetic data could be used for training robot policies. Realistic-looking and diverse synthetic datasets end in robust policies that transfer well to the actual world. NVIDIA Cosmos open world foundation models (WFMs), specifically Cosmos Transfer, could be used to scale up synthetic datasets by generating photorealistic, diverse data from a single simulation. Find your complete workflow within the Robotics Domain Adaption Gallery within the cookbook. 

Along with this workflow, the NVIDIA Cosmos Cookbook offers step-by-step recipes and post-training scripts to quickly construct, customize, and deploy Cosmos WFMs for robotics, autonomous, and agentic systems. It covers the next examples and ideas in-depth:

  • Quick-start inference examples to rise up and running.
  • Advanced post-training workflows for domain-specific fine-tuning.
  • Proven recipes for scalable, production-ready deployments.
  • Core concepts covering fundamental topics, techniques, architectural patterns, and gear documentation.

The Cosmos Cookbook is a resource from the physical AI community for sharing practical knowledge about Cosmos WFMs. We welcome contributions including workflows, recipes, best practices, and domain-specific adaptations on GitHub.

Getting began

On this blog, we discussed latest workflows for improving robot manipulation skills. We showed how ThinkAct uses a “considering before acting” framework to reason and execute robot actions. Next, we talked about how using simulation and real data for training leads to generalizable manipulation policies. We shared how RobotSmith generates robotic tool designs for optimized tool usage required during complex tasks. Finally, we saw how the Cosmos Cookbook provides examples and a shared place for physical AI projects using Cosmos models. 

Take a look at the next resources to learn more concerning the work discussed on this blog:

ThinkAct, Generalizable Domain Adaptation, and RobotSmith, amongst many more papers from NVIDIA research teams, were accepted at NeurIPS 2025

This post is an element of our NVIDIA Robotics Research and Development Digest (R2D2) to present developers deeper insight into the most recent breakthroughs from NVIDIA Research across physical AI and robotics applications.

Stay awake-to-date by subscribing to the newsletter and following NVIDIA Robotics on YouTube, Discord, and developer forums. To begin your robotics journey, enroll in free NVIDIA Robotics Fundamentals courses.

Acknowledgements

For his or her contributions to the research mentioned on this post, due to Ajay Mandlekar, Bohan Wang, Caelan Garrett, Chi-Pin Huang, Chuang Gan, Chunru Lin, Danfei Xu, Dieter Fox, Fu-En Yang, Haotian Yuan, Liqian Ma, Min-Hung Chen, Minghao Guo, Shuo Cheng, Tsun-Hsuan Wang, Xiaowen Qiu, Yashraj Narang, Yian Wang, Yu-Chiang Frank Wang, Yueh-Hua Wu, and Zhenyang Chen



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x