A robot trying to find staff trapped in a partially collapsed mine shaft must rapidly generate a map of the scene and discover its location inside that scene because it navigates the treacherous terrain.
Researchers have recently began constructing powerful machine-learning models to perform this complex task using only images from the robot’s onboard cameras, but even the most effective models can only process a number of images at a time. In a real-world disaster where every second counts, a search-and-rescue robot would want to quickly traverse large areas and process hundreds of images to finish its mission.
To beat this problem, MIT researchers drew on ideas from each recent artificial intelligence vision models and classical computer vision to develop a brand new system that may process an arbitrary variety of images. Their system accurately generates 3D maps of complicated scenes like a crowded office corridor in a matter of seconds.
The AI-driven system incrementally creates and aligns smaller submaps of the scene, which it stitches together to reconstruct a full 3D map while estimating the robot’s position in real-time.
Unlike many other approaches, their technique doesn’t require calibrated cameras or an authority to tune a posh system implementation. The simpler nature of their approach, coupled with the speed and quality of the 3D reconstructions, would make it easier to scale up for real-world applications.
Beyond helping search-and-rescue robots navigate, this method might be used to make prolonged reality applications for wearable devices like VR headsets or enable industrial robots to quickly find and move goods inside a warehouse.
“For robots to perform increasingly complex tasks, they need way more complex map representations of the world around them. But at the identical time, we don’t intend to make it harder to implement these maps in practice. We’ve shown that it is feasible to generate an accurate 3D reconstruction in a matter of seconds with a tool that works out of the box,” says Dominic Maggio, an MIT graduate student and lead writer of a paper on this method.
Maggio is joined on the paper by postdoc Hyungtae Lim and senior writer Luca Carlone, associate professor in MIT’s Department of Aeronautics and Astronautics (AeroAstro), principal investigator within the Laboratory for Information and Decision Systems (LIDS), and director of the MIT SPARK Laboratory. The research can be presented on the Conference on Neural Information Processing Systems.
Mapping out an answer
For years, researchers have been grappling with a vital element of robotic navigation called simultaneous localization and mapping (SLAM). In SLAM, a robot recreates a map of its environment while orienting itself throughout the space.
Traditional optimization methods for this task are inclined to fail in difficult scenes, or they require the robot’s onboard cameras to be calibrated beforehand. To avoid these pitfalls, researchers train machine-learning models to learn this task from data.
While they’re simpler to implement, even the most effective models can only process about 60 camera images at a time, making them infeasible for applications where a robot needs to maneuver quickly through a varied environment while processing hundreds of images.
To unravel this problem, the MIT researchers designed a system that generates smaller submaps of the scene as a substitute of your complete map. Their method “glues” these submaps together into one overall 3D reconstruction. The model continues to be only processing a number of images at a time, however the system can recreate larger scenes much faster by stitching smaller submaps together.
“This gave the look of a quite simple solution, but once I first tried it, I used to be surprised that it didn’t work that well,” Maggio says.
Looking for a proof, he dug into computer vision research papers from the Nineteen Eighties and Nineties. Through this evaluation, Maggio realized that errors in the way in which the machine-learning models process images made aligning submaps a more complex problem.
Traditional methods align submaps by applying rotations and translations until they line up. But these latest models can introduce some ambiguity into the submaps, which makes them harder to align. As an illustration, a 3D submap of a one side of a room might need partitions which are barely bent or stretched. Simply rotating and translating these deformed submaps to align them doesn’t work.
“We want to make sure that all of the submaps are deformed in a consistent way so we are able to align them well with one another,” Carlone explains.
A more flexible approach
Borrowing ideas from classical computer vision, the researchers developed a more flexible, mathematical technique that may represent all of the deformations in these submaps. By applying mathematical transformations to every submap, this more flexible method can align them in a way that addresses the paradox.
Based on input images, the system outputs a 3D reconstruction of the scene and estimates of the camera locations, which the robot would use to localize itself within the space.
“Once Dominic had the intuition to bridge these two worlds — learning-based approaches and traditional optimization methods — the implementation was fairly straightforward,” Carlone says. “Coming up with something this effective and straightforward has potential for lots of applications.
Their system performed faster with less reconstruction error than other methods, without requiring special cameras or additional tools to process data. The researchers generated close-to-real-time 3D reconstructions of complex scenes just like the within the MIT Chapel using only short videos captured on a mobile phone.
The typical error in these 3D reconstructions was lower than 5 centimeters.
In the long run, the researchers intend to make their method more reliable for especially complicated scenes and work toward implementing it on real robots in difficult settings.
“Knowing about traditional geometry pays off. In case you understand deeply what is occurring within the model, you may get a lot better results and make things way more scalable,” Carlone says.
This work is supported, partially, by the U.S. National Science Foundation, U.S. Office of Naval Research, and the National Research Foundation of Korea. Carlone, currently on sabbatical as an Amazon Scholar, accomplished this work before he joined Amazon.
