Creating realistic 3D models for applications like virtual reality, filmmaking, and engineering design is usually a cumbersome process requiring numerous manual trial and error.
While generative artificial intelligence models for images can streamline artistic processes by enabling creators to provide lifelike 2D images from text prompts, these models should not designed to generate 3D shapes. To bridge the gap, a recently developed technique called Rating Distillation leverages 2D image generation models to create 3D shapes, but its output often finally ends up blurry or cartoonish.
MIT researchers explored the relationships and differences between the algorithms used to generate 2D images and 3D shapes, identifying the foundation reason for lower-quality 3D models. From there, they crafted an easy fix to Rating Distillation, which enables the generation of sharp, high-quality 3D shapes which are closer in quality to the perfect model-generated 2D images.
Another methods attempt to fix this problem by retraining or fine-tuning the generative AI model, which may be expensive and time-consuming.
Against this, the MIT researchers’ technique achieves 3D shape quality on par with or higher than these approaches without additional training or complex postprocessing.
Furthermore, by identifying the reason for the issue, the researchers have improved mathematical understanding of Rating Distillation and related techniques, enabling future work to further improve performance.
“Now we all know where we must be heading, which allows us to search out more efficient solutions which are faster and higher-quality,” says Artem Lukoianov, an electrical engineering and computer science (EECS) graduate student who’s lead writer of a paper on this method. “In the long term, our work may help facilitate the method to be a co-pilot for designers, making it easier to create more realistic 3D shapes.”
Lukoianov’s co-authors are Haitz Sáez de Ocáriz Borde, a graduate student at Oxford University; Kristjan Greenewald, a research scientist within the MIT-IBM Watson AI Lab; Vitor Campagnolo Guizilini, a scientist on the Toyota Research Institute; Timur Bagautdinov, a research scientist at Meta; and senior authors Vincent Sitzmann, an assistant professor of EECS at MIT who leads the Scene Representation Group within the Computer Science and Artificial Intelligence Laboratory (CSAIL) and Justin Solomon, an associate professor of EECS and leader of the CSAIL Geometric Data Processing Group. The research will likely be presented on the Conference on Neural Information Processing Systems.
From 2D images to 3D shapes
Diffusion models, reminiscent of DALL-E, are a style of generative AI model that may produce lifelike images from random noise. To coach these models, researchers add noise to pictures after which teach the model to reverse the method and take away the noise. The models use this learned “denoising” process to create images based on a user’s text prompts.
But diffusion models underperform at directly generating realistic 3D shapes because there should not enough 3D data to coach them. To get around this problem, researchers developed a method called Rating Distillation Sampling (SDS) in 2022 that uses a pretrained diffusion model to mix 2D images right into a 3D representation.
The technique involves starting with a random 3D representation, rendering a 2D view of a desired object from a random camera angle, adding noise to that image, denoising it with a diffusion model, then optimizing the random 3D representation so it matches the denoised image. These steps are repeated until the specified 3D object is generated.
Nevertheless, 3D shapes produced this fashion are inclined to look blurry or oversaturated.
“This has been a bottleneck for some time. We all know the underlying model is able to doing higher, but people didn’t know why this is occurring with 3D shapes,” Lukoianov says.
The MIT researchers explored the steps of SDS and identified a mismatch between a formula that forms a key a part of the method and its counterpart in 2D diffusion models. The formula tells the model tips on how to update the random representation by adding and removing noise, one step at a time, to make it look more like the specified image.
Since a part of this formula involves an equation that is simply too complex to be solved efficiently, SDS replaces it with randomly sampled noise at each step. The MIT researchers found that this noise results in blurry or cartoonish 3D shapes.
An approximate answer
As a substitute of trying to unravel this cumbersome formula precisely, the researchers tested approximation techniques until they identified the perfect one. Fairly than randomly sampling the noise term, their approximation technique infers the missing term from the present 3D shape rendering.
“By doing this, because the evaluation within the paper predicts, it generates 3D shapes that look sharp and realistic,” he says.
As well as, the researchers increased the resolution of the image rendering and adjusted some model parameters to further boost 3D shape quality.
In the long run, they were capable of use an off-the-shelf, pretrained image diffusion model to create smooth, realistic-looking 3D shapes without the necessity for costly retraining. The 3D objects are similarly sharp to those produced using other methods that depend on ad hoc solutions.
“Attempting to blindly experiment with different parameters, sometimes it really works and sometimes it doesn’t, but you don’t know why. We all know that is the equation we want to unravel. Now, this enables us to think about more efficient ways to unravel it,” he says.
Because their method relies on a pretrained diffusion model, it inherits the biases and shortcomings of that model, making it vulnerable to hallucinations and other failures. Improving the underlying diffusion model would enhance their process.
Along with studying the formula to see how they might solve it more effectively, the researchers are fascinated about exploring how these insights could improve image editing techniques.
This work is funded, partly, by the Toyota Research Institute, the U.S. National Science Foundation, the Singapore Defense Science and Technology Agency, the U.S. Intelligence Advanced Research Projects Activity, the Amazon Science Hub, IBM, the U.S. Army Research Office, the CSAIL Way forward for Data program, the Wistron Corporation, and the MIT-IBM Watson AI Laboratory.