The First Healthcare Robotics Dataset and Foundational Physical AI Models for Healthcare Robotics

-


Authors: Nigel Nelson, Lukas Zbinden, Mostafa Toloui, Sean Huver

Healthcare AI has mainly been perception-based, specializing in models that interpret signals and classify or segment pathology/anatomy. Nonetheless, healthcare involves “doing,” making the static, perception-only datasets of the past—which lack embodiment, contact dynamics, and closed-loop control—insufficient. The sector needs standardized robot bodies, synchronized vision–force–kinematics data, sim-to-real pairing, and cross-embodiment benchmarks to construct the inspiration for Physical AI.



1. Open-H-Embodiment

Open-H-Embodiment is a community‑driven dataset initiative constructing the open, shared foundation needed to coach and evaluate AI autonomy and world foundation models for surgical robotics and ultrasound. Began by a steering committee including Prof. Axel Krieger (Johns Hopkins), Prof. Nassir Navab (Technical University of Munich), and Dr. Mahdi Azizian (NVIDIA), the trouble now spans 35 organizations.

Participants from world wide got here together to construct the primary large scale dataset to advance the explanation for physical AI in healthcare robotics.

open_h_sample
Open-H-Embodiment Sample data



Participants

Balgrist, CMR Surgical, The Chinese University of Hong Kong, Great Bay University, Hong Kong Baptist University, Hamlyn, ImFusion, Johns Hopkins University, Leeds University, Mohamed bin Zayed University of Artificial Intelligence, Moon Surgical, NVIDIA, Northwell Health, Obuda University, The Hong Kong Polytechnic University, Qilu Hospital of Shandong University, Rob Surgical, Sanoscience, Surgical Data Science Collective, Semaphor Surgical, Stanford, Dresden University of Technology, Technical University of Munich, Tuodao, Turin, University of British Columbia, UC Berkeley, UC San Diego, University of Illinois Chicago, University of Tennessee, University of Texas, Vanderbilt, and Virtual Incision.



The Dataset

  • Comprises 778 hours of CC-BY-4.0 healthcare robotics training data, largely surgical robotics, but in addition ultrasound and colonoscopy autonomy data.
  • Spans simulation, benchtop exercises (e.g., suturing), and real clinical procedures.
  • Uses industrial robots (CMR Surgical, Rob Surgical, Tuodao) and research robots (dVRK, Franka, Kuka).
  • Released alongside two latest, permissively open-source models post-trained on this data.



2. GR00T-H: Vision Language Motion Model for Surgical Robotics

First is GR00T-H, a derivative of the Isaac GR00T N series of Vision-Language-Motion (VLA) models. Trained on roughly 600 hours of Open-H-Embodiment data, GR00T-H is the primary policy model for surgical robotics tasks.

Constructing on NVIDIA’s open-source ecosystem, Isaac GR00T-H leverages Cosmos Reason 2 2B as its Vision-Language Model (VLM) backbone.

pyramid



Architectural Design Decisions

Surgical robotics requires high precision, but specialized hardware (like cable-driven systems) makes imitation learning (IL) difficult. To handle this, GR00T-H uses 4 key design selections:

  • Unique Embodiment Projectors: A singular, learnable MLP maps each robot’s specific kinematics to a shared, normalized motion space.
  • State Dropout (100%): Proprioceptive input is dropped during inference to create a learned bias term for every system, yielding higher real-world results.
  • Relative EEF Actions: Training uses a typical relative End-Effector (EEF) motion space to beat kinematic inconsistencies.
  • Metadata in Task Prompts: Instrument names and control index mapping are injected directly into the VLM task prompt.

A prototype of GR00T-H has demonstrated the power to execute a whole, end-to-end suture within the SutureBot benchmark, highlighting robust long-horizon dexterity.

gr00t_suture
GR00T-H performing end-to-end suturing.




3. Cosmos-H-Surgical-Simulator

Cosmos-H-Surgical-Simulator is a World Foundation Model (WFM) for action-conditioned surgical robotics. Traditional simulators fail because of real-world complexities like soft-tissue, reflections, blood, and smoke.



Key Capabilities

  • Overcoming the Sim-to-Real Gap: Wonderful-tuned from NVIDIA Cosmos Predict 2.5 2B, it generates physically plausible surgical video directly from kinematic actions.
  • Efficiency Gains: For 600 rollouts, it took only 40 minutes in simulation versus 2 days using real-world benchtop methods.
  • WFM as a Physics Simulator: Implicitly learns tissue deformation and gear interaction from data.
  • Synthetic Data Generation: Generates realistic synthetic video-action pairs to enhance underrepresented datasets.

cosmos_h_surg_sim



Wonderful-Tuning Details

The model was fine-tuned on the Open-H-Embodiment dataset (9 robot embodiments, 32 datasets) using 64x A100 GPUs for about 10,000 GPU-hours. It utilizes a unified 44-dimensional motion space.




4. What’s Next: Towards Reasoning For Surgical Robotics

The goal for version 2 of the Open-H-Embodiment effort is to maneuver beyond perceptual control to reasoning-capable autonomy—a surgical robotics ChatGPT moment—where systems can explain, plan, and adapt across long procedures. This requires extending Open-H-Embodiment into reasoning-ready data with annotated task traces capturing intents, outcomes, and failure modes. This effort needs community engagement, and we invite you to become involved. Visit our Open-H Github Repo to assist shape the longer term of healthcare robotics.




5. Start today

Access the next resources to start out working with the Open-H-embodiment dataset and models:



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x