NVIDIA Cosmos Reason 2 Brings Advanced Reasoning To Physical AI

-


Tsung-Yi Lin's avatar

Debraj Sinha's avatar


NVIDIA today released Cosmos Reason 2, the most recent advancement in open, reasoning vision language models for physical AI. Cosmos Reason 2 surpasses its previous version in accuracy and tops the Physical AI Bench and Physical Reasoning leaderboards because the #1 open model for visual understanding.



NVIDIA Cosmos Reason 2: Reasoning Vision Language Model for Physical AI

Since their introduction, vision-language models have rapidly improved at tasks like object and pattern recognition in images. But they still struggle with tasks humans find natural, like planning several steps ahead, coping with uncertainty or adapting to latest situations. Cosmos Reason is designed to shut this gap by giving robots and AI agents stronger common sense and reasoning to resolve complex problems step-by-step.

Cosmos Reason 2 is a state-of-the-art, open reasoning vision-language model (VLM) that permits robots and AI agents to see, understand, plan, and act within the physical world like humans. It uses common sense, physics, and prior knowledge to acknowledge how objects move across space and time to handle complex tasks, adapt to latest situations, and determine easy methods to solve problems step-by-step.



✨ Key Highlights

  • Improved spatio-temporal understanding and timestamp precision.

  • Optimized performance with flexible deployment options from edge to cloud with 2B and 8B parameters model sizes.

  • Support for expanded set of spatial understanding and visual perception capabilities — 2D/3D point localization, bounding box coordinates, trajectory data, and OCR support.

  • Improved long-context understanding with 256K input tokens, up from 16K with Cosmos Reason 1.

  • Adaptable to multiple use cases with easy-to-use Cosmos Cookbook recipes.



🤖 Popular Use Cases

  • Video analytics AI agents — These agents can extract useful insights from massive volumes of video data to optimize processes. Cosmos Reason 2 builds on the capabilities of Cosmos Reason 1 and now provides OCR support, in addition to 2D/3D point localization and a set of mark understanding.

    Example of how Cosmos Reason can understand text embedded inside a video to find out the condition of the road during a rainstorm.

    Developers can jumpstart development of video analytics AI agents by utilizing the NVIDIA blueprint for video search and summarization (VSS) with Cosmos Reason because the VLM.

    Salesforce is transforming workplace safety and compliance by analyzing video footage captured by Cobalt robots with Agentforce and VSS blueprint with Cosmos Reason because the VLM.

  • Data annotation and critique — Enable developers to automate high-quality annotation and critique of massive, diverse training datasets. Cosmos Reason provides time stamps and detailed descriptions for real or synthetically generated training videos.

    Data annotation and critique example
    Example of a sample prompt to generate detailed, time-stamped captions for a race automotive video.

    Uber is exploring Cosmos Reason 2 to deliver accurate, searchable video captions for autonomous vehicle (AV) training data, enabling efficient identification of critical driving scenarios. This co-authored Reason 2 for AV Video Captioning and VQA recipe demonstrates easy methods to fine-tune and evaluate Cosmos Reason 2-8B on annotated AV videos. Across multiple evaluation metrics, measurable improvements were achieved: BLEU scores improved 10.6% (0.113 → 0.125), MCQ-based VQA gained 0.67 percentage points (80.18% → 80.85%), and LingoQA increased 13.8% (63.2% → 77.0%). These gains exhibit effective domain adaptation for AV applications.

  • Robot planning and reasoning — Act because the brain for deliberate, methodical decision-making in a robot vision language motion (VLA) model. Cosmos Reason 2 now provides trajectory coordinates along with determining next steps.

    Example of the prompt and JSON output from Cosmos Reason 2 to supply the steps and trajectory the robot gripper must take to maneuver the painter’s tape into the basket.

    Encord provides native support for Cosmos Reason 2 in its Data Agent library and AI data platform, enabling developers to leverage Cosmos Reason 2 as a VLA for robotics and other physical AI use cases.

Firms like Hitachi, Milestone and VAST Data are using Cosmos Reason to advance robotics, autonomous driving, and video analytics AI agents for traffic and workplace safety.

Try Cosmos Reason 2 on construct.nvidia.com and experience the most recent features with sample prompts for generating bounding boxes and robot trajectories. Upload your personal videos and pictures for further evaluation.

Download Cosmos Reason 2 models (2B and 8B) on Hugging Face or use Cosmos Reason 2 within the cloud. The model shall be available soon on Amazon Web Services, Google Cloud and Microsoft Azure. To start, try Cosmos Reason 2 documentation and the Cosmos Cookbook.



Other Models From The Cosmos Family:



🔮 Cosmos Predict 2.5

Cosmos Predict is a generative AI model that predicts future states of the physical world as video, based on text, image, or video inputs.

  • Physical AI Bench leader for quality, accuracy and overall consistency.
  • As much as 30 seconds of physically and temporally consistent clip per generation.
  • Supports multiple framerates and backbone.
  • Pre-trained on 200 million clips.
  • Available as 2B and 14B pre-trained models and various 2B post-trained models for multiview, motion conditioning and autonomous vehicle training.

Try model card>>

🔁 Cosmos Transfer 2.5

Cosmos Transfer is our lightest multicontrol model built for video to world style transfer.

  • Scale a single simulation or spatial video across various environments and lighting conditions.
  • Improved prompt adherence and physics alignment.
  • Use with NVIDIA Isaac Sim™ or NVIDIA Omniverse NuRec for simulation to real transformation.

Try model card>>

🤖 NVIDIA GR00T N1.6

NVIDIA GR00T N1.6 is an open reasoning vision language motion (VLA) model, purpose-built for humanoid robots, that unlocks full body control and uses NVIDIA Cosmos Reason for higher reasoning and contextual understanding.



Resources

🧑🏻‍🍳 Read the Cosmos Cookbook → https://nvda.ws/4qevli8

📚 Explore Models & Datasets → https://github.com/nvidia-cosmos

⬇️ Try Cosmos Models in our Hosted Catalog → https://nvda.ws/3Yg0Dcx

💻 Join the Cosmos Community → https://discord.gg/u23rXTHSC9

🗳️ Contribute to the Cosmos Cookbook → https://nvda.ws/4aQcBkk



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x