Robot, know thyself: Latest vision-based system teaches machines to know their bodies

In an office at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), a soft robotic hand fastidiously curls its fingers to understand a small object. The intriguing part isn’t the mechanical design or embedded sensors — actually, the hand accommodates none. As a substitute, the whole system relies on a single camera that watches the robot’s movements and uses that visual data to manage it.

This capability comes from a brand new system CSAIL scientists developed, offering a special perspective on robotic control. Somewhat than using hand-designed models or complex sensor arrays, it allows robots to learn the way their bodies respond to manage commands, solely through vision. The approach, called Neural Jacobian Fields (NJF), gives robots a sort of bodily self-awareness. An open-access paper in regards to the work was published in on June 25.

“This work points to a shift from programming robots to teaching robots,” says Sizhe Lester Li, MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and lead researcher on the work. “Today, many robotics tasks require extensive engineering and coding. In the longer term, we envision showing a robot what to do, and letting it learn how one can achieve the goal autonomously.”

The motivation stems from a straightforward but powerful reframing: The foremost barrier to reasonably priced, flexible robotics is not hardware — it’s control of capability, which may very well be achieved in multiple ways. Traditional robots are built to be rigid and sensor-rich, making it easier to construct a digital twin, a precise mathematical replica used for control. But when a robot is soft, deformable, or irregularly shaped, those assumptions collapse. Somewhat than forcing robots to match our models, NJF flips the script — giving robots the power to learn their very own internal model from commentary.

Look and learn

This decoupling of modeling and hardware design could significantly expand the design space for robotics. In soft and bio-inspired robots, designers often embed sensors or reinforce parts of the structure simply to make modeling feasible. NJF lifts that constraint. The system doesn’t need onboard sensors or design tweaks to make control possible. Designers are freer to explore unconventional, unconstrained morphologies without worrying about whether or not they’ll have the ability to model or control them later.

“Take into consideration the way you learn to manage your fingers: you wiggle, you observe, you adapt,” says Li. “That’s what our system does. It experiments with random actions and figures out which controls move which parts of the robot.”

The system has proven robust across a variety of robot types. The team tested NJF on a pneumatic soft robotic hand able to pinching and grasping, a rigid Allegro hand, a 3D-printed robotic arm, and even a rotating platform with no embedded sensors. In every case, the system learned each the robot’s shape and the way it responded to manage signals, just from vision and random motion.

The researchers see potential far beyond the lab. Robots equipped with NJF could at some point perform agricultural tasks with centimeter-level localization accuracy, operate on construction sites without elaborate sensor arrays, or navigate dynamic environments where traditional methods break down.

On the core of NJF is a neural network that captures two intertwined features of a robot’s embodiment: its three-dimensional geometry and its sensitivity to manage inputs. The system builds on neural radiance fields (NeRF), a method that reconstructs 3D scenes from images by mapping spatial coordinates to paint and density values. NJF extends this approach by learning not only the robot’s shape, but in addition a Jacobian field, a function that predicts how any point on the robot’s body moves in response to motor commands.

To coach the model, the robot performs random motions while multiple cameras record the outcomes. No human supervision or prior knowledge of the robot’s structure is required — the system simply infers the connection between control signals and motion by watching.

Once training is complete, the robot only needs a single monocular camera for real-time closed-loop control, running at about 12 Hertz. This permits it to constantly observe itself, plan, and act responsively. That speed makes NJF more viable than many physics-based simulators for soft robots, which are sometimes too computationally intensive for real-time use.

In early simulations, even easy 2D fingers and sliders were in a position to learn this mapping using just a number of examples. By modeling how specific points deform or shift in response to motion, NJF builds a dense map of controllability. That internal model allows it to generalize motion across the robot’s body, even when the information are noisy or incomplete.

“What’s really interesting is that the system figures out by itself which motors control which parts of the robot,” says Li. “This isn’t programmed — it emerges naturally through learning, very similar to an individual discovering the buttons on a brand new device.”

The long run is soft

For many years, robotics has favored rigid, easily modeled machines — like the commercial arms present in factories — because their properties simplify control. But the sector has been moving toward soft, bio-inspired robots that may adapt to the true world more fluidly. The trade-off? These robots are harder to model.

“Robotics today often feels out of reach due to costly sensors and complicated programming. Our goal with Neural Jacobian Fields is to lower the barrier, making robotics reasonably priced, adaptable, and accessible to more people. Vision is a resilient, reliable sensor,” says senior creator and MIT Assistant Professor Vincent Sitzmann, who leads the Scene Representation group. “It opens the door to robots that may operate in messy, unstructured environments, from farms to construction sites, without expensive infrastructure.”

“Vision alone can provide the cues needed for localization and control — eliminating the necessity for GPS, external tracking systems, or complex onboard sensors. This opens the door to robust, adaptive behavior in unstructured environments, from drones navigating indoors or underground without maps to mobile manipulators working in cluttered homes or warehouses, and even legged robots traversing uneven terrain,” says co-author Daniela Rus, MIT professor of electrical engineering and computer science and director of CSAIL. “By learning from visual feedback, these systems develop internal models of their very own motion and dynamics, enabling flexible, self-supervised operation where traditional localization methods would fail.”

While training NJF currently requires multiple cameras and have to be redone for every robot, the researchers are already imagining a more accessible version. In the longer term, hobbyists could record a robot’s random movements with their phone, very similar to you’d take a video of a rental automobile before driving off, and use that footage to create a control model, with no prior knowledge or special equipment required.

The system doesn’t yet generalize across different robots, and it lacks force or tactile sensing, limiting its effectiveness on contact-rich tasks. However the team is exploring recent ways to deal with these limitations: improving generalization, handling occlusions, and increasing the model’s ability to reason over longer spatial and temporal horizons.

“Just as humans develop an intuitive understanding of how their bodies move and reply to commands, NJF gives robots that sort of embodied self-awareness through vision alone,” says Li. “This understanding is a foundation for flexible manipulation and control in real-world environments. Our work, essentially, reflects a broader trend in robotics: moving away from manually programming detailed models toward teaching robots through commentary and interaction.”

This paper brought together the pc vision and self-supervised learning work from the Sitzmann lab and the expertise in soft robots from the Rus lab. Li, Sitzmann, and Rus co-authored the paper with CSAIL affiliates Annan Zhang SM ’22, a PhD student in electrical engineering and computer science (EECS); Boyuan Chen, a PhD student in EECS; Hanna Matusik, an undergraduate researcher in mechanical engineering; and Chao Liu, a postdoc within the Senseable City Lab at MIT.

The research was supported by the Solomon Buchsbaum Research Fund through MIT’s Research Support Committee, an MIT Presidential Fellowship, the National Science Foundation, and the Gwangju Institute of Science and Technology.

Robot, know thyself: Latest vision-based system teaches machines to know their bodies

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How Agents Plan Tasks with To-Do Lists

A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

How social media encourages the worst of AI boosterism

Hugging Face + PyCharm

The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel

Robot, know thyself: Latest vision-based system teaches machines to know their bodies

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.