The following generation of AI-driven robots like humanoids and autonomous vehicles relies on high-fidelity, physics-aware training data. Without diverse and representative datasets, these systems don’t get proper training and face testing risks as a consequence of poor generalization, limited exposure to real-world variations, and unpredictable behavior in edge cases. Collecting massive real-world datasets for training is pricey, time-intensive, and infrequently constrained by possibilities.
NVIDIA Cosmos addresses this challenge by accelerating world foundation model (WFM) development. On the core of its platform, Cosmos WFMs speed up synthetic data generation and act as a foundation for post-training, to develop downstream domain or task-specific physical AI models to unravel these challenges. This post explores the newest Cosmos WFMs, their key capabilities that advance physical AI, and the way to use them.
Cosmos world foundation model updates:
NVIDIA Cosmos world foundation models have continued to evolve rapidly, with significant advancements that further speed up synthetic data generation and physical AI development. One 12 months after their introduction, key updates include:
- Cosmos Transfer 2.5—Faster and more scalable data augmentation from simulation and 3D spatial inputs, enabling greater diversity across environments, lighting conditions, and scene variations.
- Cosmos Predict 2.5—Enhanced long-tail scenario generation for sequences as much as 30 seconds, delivering as much as 10x higher accuracy when post-trained on proprietary or domain-specific data. Supports multiview outputs, custom camera layouts, and alternate policy outputs equivalent to motion simulation.
- Cosmos Reason 2—Advanced physical AI reasoning with improved spatiotemporal understanding and timestamp precision. Adds object detection with 2D/3D point localization and bounding box coordinates, together with reasoning explanations and labels. Expanded long-context support as much as 256K input tokens.
Cosmos Transfer for photorealistic videos grounded in physics
Cosmos Transfer generates high-fidelity world scenes from structural inputs, ensuring precise spatial alignment and scene composition.
Employing the ControlNet architecture, Cosmos Transfer preserves pretrained knowledge, enabling structured, consistent outputs. It utilizes spatiotemporal control maps to dynamically align synthetic and real-world representations, enabling fine-grained control over scene composition, object placement, and motion dynamics.
Inputs:Â
- Structured visual or geometric data: segmentation maps, depth maps, edge maps, human motion keypoints, LiDAR scans, trajectories, HD maps, and 3D bounding boxes.
- Ground truth annotations: high-fidelity references for precise alignment.
Output: Photorealistic video sequences with controlled layout, object placement, and motion.




Figure 1. On the left, a virtual simulation or ‘ground truth’ created in NVIDIA Omniverse. On the proper, photoreal transformation using Cosmos Transfer
Key capabilities:
- Generate scalable, photorealistic synthetic data that aligns with real-world physics.
- Control object interactions and scene composition through structured multimodal inputs.
Using Cosmos Transfer for controllable synthetic data
With generative AI APIs and SDKs, NVIDIA Omniverse accelerates physical AI simulation. Developers use NVIDIA Omniverse, built on OpenUSD, to create 3D scenes that accurately simulate real-world environments for training and testing robots and autonomous vehicles. These simulations function ground truth video inputs for Cosmos Transfer, combined with annotations and text instructions. Cosmos Transfer enhances photorealism while various environment, lighting, and visual conditions to generate scalable, diverse world states.
This workflow accelerates the creation of high-quality training datasets, ensuring AI agents generalize effectively from simulation to real-world deployment.Â




Cosmos Transfer enhances robotics development by enabling realistic lighting, colours, and textures within the Isaac GR00T Blueprint for synthetic manipulation motion generation and Omniverse Blueprint for Autonomous Vehicle Simulation for various environmental and weather conditions for training. This photorealistic data is crucial for post-training policy models, ensuring smooth simulation-to-reality transfer and supporting model training for perception AI and specialized robot models like GR00T N1.
How you can run the brand new Cosmos Transfer 2.5:
Cosmos Predict for generating future world states
Cosmos Predict WFM is designed to model future world states as video from multimodal inputs, including text, video, and start-end frame sequences. It’s built using transformer-based architectures that enhance temporal consistency and frame interpolation.
Key capabilities:
- Generates realistic world states directly from text prompts.
- Predict next states based on video sequences by predicting missing frames or extending motion.
- Multiframe generation between a starting and ending image, creating a whole, smooth sequence.Â
Cosmos Predict WFM provides a powerful foundation for training downstream world models in robotics and autonomous vehicles. You may post-train these models to generate actions as a substitute of video for policy modeling or adapt it for visual-language understanding to create custom perception AI models.
How you can run the brand new Cosmos Predict 2.5:
Cosmos Reason to perceive, reason, and respond intelligently
Cosmos Reason is a completely customizable multimodal AI reasoning model that’s purpose-built to grasp motion, object interactions, and space-time relationships. Using chain-of-thought (CoT) reasoning, the model interprets visual input, predicts outcomes based on the given prompt, and rewards the optimal decision. Unlike text-based LLMs, it grounds reasoning in real-world physics, generating clear, context-aware responses in natural language.
Input: Video observations and a text-based query or instruction.
Output: Text response generated through long-horizon CoT reasoning.
Key capabilities:
- Knows how objects move, interact, and alter over time.
- Predicts and rewards the following best motion based on input statement.
- Repeatedly refines decision-making.
- Purpose-built for post-training to construct perception AI and embodied AI models.
Training pipeline
Cosmos Reason is trained in three stages, enhancing its ability to reason, predict, and reply to decisions in real-world scenarios.
- Pretraining: Uses a Vision Transformer (ViT) to process video frames into structured embeddings, aligning them with text for a shared understanding of objects, actions, and spatial relationships.
- Supervised fine-tuning (SFT): Specializes the model in physical reasoning across two key levels. General fine-tuning enhances language grounding and multimodal perception using diverse video-text datasets, while more training on physical AI data sharpens the model’s ability to reason about real-world interactions. It learns object behaviors like how objects could be utilized in the true world, motion sequences, determining how multi-step tasks unfold, and spatial feasibility to tell apart realistic from inconceivable placements.


Reinforcement learning (RL): The model evaluates different reasoning paths and updates itself only when a greater decision emerges through trial and reward feedback. As a substitute of counting on human-labeled data, it uses rule-based rewards:
- Entity recognition: Rewarding accurate identification of objects and their properties.
- Spatial constraints: Penalizing physically inconceivable placements while reinforcing realistic object positioning.
- Temporal reasoning: Encouraging correct sequence prediction based on cause-effect relationships.
How you can run the brand new Cosmos Reason 2:
Start
Updated on March 13, 2026, with advancements to NVIDIA Cosmos world foundation models.
