We’re repeatedly expanding NVIDIA Cosmos™ world foundation models (WFMs) to tackle a number of the hardest problems in robotics, autonomous vehicle development, and industrial vision AI.
To further support this effort, we’re introducing Cosmos Policy, our latest research on advancing robot control and planning using Cosmos WFMs.
TL;DR
-
Cosmos Policy: A brand new state-of-the-art robot control policy that post-trains the Cosmos Predict-2 world foundation model for manipulation tasks. It directly encodes robot actions and future states into the model, achieving SOTA performance on LIBERO and RoboCasa benchmarks.
-
Cosmos Cookoff: An open hackathon where developers can get hands-on with Cosmos world foundation models and push the boundaries of physical AI.
Overview: What Is Cosmos Policy?
Cosmos Policy is a robot control and planning policy obtained by fine-tuning Cosmos Predict, a world foundation model trained to predict future frames. As an alternative of introducing recent architectural components or separate motion modules, Cosmos Policy adapts the pretrained model directly through a single stage of post-training on robot demonstration data.
A Policy is the system’s decision-making brain that maps observations (akin to camera images) to physical actions (like moving a robotic arm) to finish tasks.
What’s different?
The breakthrough of Cosmos Policy is the way it represents this data. As an alternative of constructing separate neural networks for the robot’s perception and control , it treats robot actions, physical states, and success scores similar to frames in a video.
All of those are encoded as additional latent frames. These are learned using the identical diffusion process as video generation, allowing the model to inherit its pre-learned understanding of physics, gravity, and the way scenes evolve over time.
Latent refers back to the compressed, mathematical language a model uses to know data internally (somewhat than raw pixels).
In consequence, a single model can:
- Predict motion chunks to guide robots movement using hand-eye coordination (i.e.,visuomotor control)
- Predict future robot observations for world modeling
- Predict expected returns (i.e. value function) for planning
All three capabilities are learned jointly inside one unified model.
Cosmos Policy will be deployed either as a direct policy, where only actions are generated at inference time, or as a planning policy, where multiple candidate actions are evaluated by predicting their resulting future states and values.
Base model: Cosmos Predict and why it matters
Recent work in robotic manipulation has increasingly relied on large pretrained backbones to enhance generalization and data efficiency. Most of those approaches construct on vision-language models (VLMs) trained on large-scale image–text datasets and fine-tuned to predict robot actions.
These models learn to know videos and describe what they see, but they don’t learn how you can physically perform actions. A VLM can suggest high-level actions like turn left or pick up the purple cup, but it surely doesn’t know how you can carry them out precisely.
In contrast, WFMs are trained to predict how scenes evolve over time and generate temporal dynamics with videos. These capabilities are directly relevant to robot control, where actions must account for the way the environment and the robot’s own state change over time.
Cosmos Predict is trained for physical AI using a diffusion objective over continuous spatiotemporal latents, enabling it to model complex, high-dimensional, and multimodal distributions across long temporal horizons.
This design makes Cosmos Predict a natural foundation for visuomotor control:
- The model already learns state transitions through future-frame prediction.
- Its diffusion formulation supports multimodal outputs, which is critical for tasks with multiple valid motion sequences.
- The transformer-based denoiser can scale to long sequences and multiple modalities.
Cosmos Policy is built on post-trained Cosmos Predict2 to generate robot actions alongside future observations and value estimates, using the model’s native diffusion process. This enables the policy to totally inherit the pretrained model’s understanding of temporal structure and physical interaction while remaining easy to coach and deploy.
⚡Vital Update: Latest Cosmos Predict 2.5 is here. Try the model card.
Results at a Glance
Cosmos Policy is evaluated across simulation benchmarks and real-world robot manipulation tasks, comparing against diffusion-based policies trained from scratch, video-based robot policies, and fine-tuned vision-language-action (VLA) models.
Cosmos Policy is evaluated on LIBERO and RoboCasa, two standard benchmarks for multi-task and long-horizon robotic manipulation.
On LIBERO, Cosmos Policy consistently outperforms prior diffusion policies and VLA-based approaches across task suites, particularly on tasks that require precise temporal coordination and multi-step execution.
| Model | Spatial SR (%) | Object SR (%) | Goal SR (%) | Long SR (%) | Average SR (%) |
|---|---|---|---|---|---|
| Diffusion Policy | 78.3 | 92.5 | 68.3 | 50.5 | 72.4 |
| Dita | 97.4 | 94.8 | 93.2 | 83.6 | 92.3 |
| π0 | 96.8 | 98.8 | 95.8 | 85.2 | 94.2 |
| UVA | — | — | — | 90.0 | — |
| UniVLA | 96.5 | 96.8 | 95.6 | 92.0 | 95.2 |
| π0.5 | 98.8 | 98.2 | 98.0 | 92.4 | 96.9 |
| Video Policy | — | — | — | 94.0 | — |
| OpenVLA-OFT | 97.6 | 98.4 | 97.9 | 94.5 | 97.1 |
| CogVLA | 98.6 | 98.8 | 96.6 | 95.4 | 97.4 |
| Cosmos Policy (ours) | 98.1 | 100.0 | 98.2 | 97.6 | 98.5 |
On RoboCasa, Cosmos Policy achieves higher success rates than baselines trained from scratch, demonstrating improved generalization across diverse household manipulation scenarios.
| Model | # Training Demos per Task | Average SR (%) |
|---|---|---|
| GR00T-N1 | 300 | 49.6 |
| UVA | 50 | 50.0 |
| DP-VLA | 3000 | 57.3 |
| GR00T-N1 + DreamGen | 300 (+10000 synthetic) | 57.6 |
| GR00T-N1 + DUST | 300 | 58.5 |
| UWM | 1000 | 60.8 |
| π0 | 300 | 62.5 |
| GR00T-N1.5 | 300 | 64.1 |
| Video Policy | 300 | 66.0 |
| FLARE | 300 | 66.4 |
| GR00T-N1.5 + HAMLET | 300 | 66.4 |
| Cosmos Policy (ours) | 50 | 67.1 |
In each benchmarks, initializing from Cosmos Predict provides a major performance advantage over training equivalent architectures without video pretraining.
Planning vs. Direct Policy Execution
When deployed as a direct policy, Cosmos Policy already matches or exceeds state-of-the-art performance on most tasks.
When enhanced with model-based planning, we observe a 12.5% higher task completion rate on average in two difficult real-world manipulation tasks.
Real-World Manipulation
Cosmos Policy can be evaluated on real-world bimanual manipulation tasks using the ALOHA robot platform.
The policy successfully executes long-horizon manipulation tasks directly from visual observations.
Learn more about architecture and results here.
What’s Next: Cosmos Cookoff
Cosmos Policy is just out of research, and this work represents an early step toward adapting world foundation models for robot control and planning. We’re actively working with early adopters to evolve this research for our robotics community.
In parallel, the Cosmos Policy continues to be available to developers through practical Cosmos Cookbook recipe, which demonstrates how you possibly can adopt and construct it.
To support hands-on experimentation with Cosmos WFMs, we’re announcing the Cosmos Cookoff, an open hackathon focused on constructing applications and workflows using Cosmos models and cookbook recipes.The newest Cookoff is live, inviting physical AI developers across robotics, autonomous vehicles, and video analytics to explore, prototype fast, and learn with experts.
🍳 Join the Cosmos Cookoff
- 📅 When: Jan 29 – Feb 26
- 👥 Team Format: As much as 4 member team
- 🏆 Prizes: $5,000 money prize, NVIDIA DGX Spark™, NVIDIA GeForce RTX™ 5090 GPU, and more!
- 🧑⚖️ Judges: Projects will likely be reviewed by experts from Datature, Hugging Face, Nebius, Nexar, and NVIDIA, bringing deep experience in open models, cloud/compute, and real-world edge and vision AI deployments.
