In-Paint3D: Image Generation using Lightning Less Diffusion Models

-

The arrival of deep generative AI models has significantly accelerated the event of AI with remarkable capabilities in natural language generation, 3D generation, image generation, and speech synthesis. 3D generative models have transformed quite a few industries and applications, revolutionizing the present 3D production landscape. Nevertheless, many current deep generative models encounter a typical roadblock: complex wiring and generated meshes with lighting textures are sometimes incompatible with traditional rendering pipelines like PBR (Physically Based Rendering). Diffusion-based models, which generate 3D assets without lighting textures, possess remarkable capabilities for diverse 3D asset generation, thereby augmenting existing 3D frameworks across industries resembling filmmaking, gaming, and augmented/virtual reality.

In this text, we’ll discuss Paint3D, a novel coarse-to-fine framework capable of manufacturing diverse, high-resolution 2K UV texture maps for untextured 3D meshes, conditioned on either visual or textual inputs. The important thing challenge that Paint3D addresses is generating high-quality textures without embedding illumination information, allowing users to re-edit or re-light inside modern graphics pipelines. To tackle this issue, the Paint3D framework employs a pre-trained 2D diffusion model to perform multi-view texture fusion and generate view-conditional images, initially producing a rough texture map. Nevertheless, since 2D models cannot fully disable lighting effects or completely represent 3D shapes, the feel map may exhibit illumination artifacts and incomplete areas.

In this text, we’ll explore the Paint3D framework in-depth, examining its working and architecture, and comparing it against state-of-the-art deep generative frameworks. So, let’s start.

Deep Generative AI models have demonstrated exceptional capabilities in natural language generation, 3D generation, and image synthesis, and have been implemented in real-life applications, revolutionizing the 3D generation industry. Nevertheless, despite their remarkable capabilities, modern deep generative AI frameworks often produce meshes with complex wiring and chaotic lighting textures which are incompatible with conventional rendering pipelines, including Physically Based Rendering (PBR). Similarly, texture synthesis has advanced rapidly, especially with using 2D diffusion models. These models effectively utilize pre-trained depth-to-image diffusion models and text conditions to generate high-quality textures. Nevertheless, a major challenge stays: pre-illuminated textures can adversely affect the ultimate 3D environment renderings, introducing lighting errors when the lights are adjusted inside common workflows, as demonstrated in the next image.

As observed, texture maps without pre-illumination work seamlessly with traditional rendering pipelines, delivering accurate results. In contrast, texture maps with pre-illumination include inappropriate shadows when relighting is applied. Texture generation frameworks trained on 3D data offer an alternate approach, generating textures by understanding a selected 3D object’s entire geometry. While these frameworks might deliver higher results, they lack the generalization capabilities needed to use the model to 3D objects outside their training data.

Current texture generation models face two critical challenges: achieving broad generalization across different objects using image guidance or diverse prompts, and eliminating coupled illumination from pre-training results. Pre-illuminated textures can interfere with the ultimate outcomes of textured objects inside rendering engines. Moreover, since pre-trained 2D diffusion models only provide 2D ends in the view domain, they lack a comprehensive understanding of shapes, resulting in inconsistencies in maintaining view consistency for 3D objects.

To handle these challenges, the Paint3D framework develops a dual-stage texture diffusion model for 3D objects that generalizes across different pre-trained generative models and preserves view consistency while generating lighting-free textures.

Paint3D is a dual-stage, coarse-to-fine texture generation model that leverages the strong prompt guidance and image generation capabilities of pre-trained generative AI models to texture 3D objects. In the primary stage, Paint3D samples multi-view images from a pre-trained depth-aware 2D image diffusion model progressively, enabling the generalization of high-quality, wealthy texture results from diverse prompts. The model then generates an initial texture map by back-projecting these images onto the 3D mesh surface. Within the second stage, the model focuses on generating lighting-free textures by implementing approaches employed by diffusion models specialized in removing lighting influences and refining shape-aware incomplete regions. Throughout the method, the Paint3D framework consistently generates high-quality 2K textures semantically, eliminating intrinsic illumination effects.

In summary, Paint3D is a novel, coarse-to-fine generative AI model designed to supply diverse, lighting-free, high-resolution 2K UV texture maps for untextured 3D meshes. It goals to attain state-of-the-art performance in texturing 3D objects with different conditional inputs, including text and pictures, offering significant benefits for synthesis and graphics editing tasks.

Methodology and Architecture

The Paint3D framework generates and refines texture maps progressively to supply diverse and high-quality textures for 3D models using conditional inputs resembling images and prompts, as demonstrated in the next image.

Stage 1: Progressive Coarse Texture Generation

Within the initial coarse texture generation stage, Paint3D employs pre-trained 2D image diffusion models to sample multi-view images, that are then back-projected onto the mesh surface to create the initial texture maps. This stage begins with generating a depth map from various camera views. The model uses depth conditions to sample images from the diffusion model, that are then back-projected onto the 3D mesh surface. This alternate rendering, sampling, and back-projection approach enhances the consistency of texture meshes and aids in progressively generating the feel map.

The method starts with the visible regions of the 3D mesh, specializing in generating texture from the primary camera view by rendering the 3D mesh to a depth map. A texture image is then sampled based on appearance and depth conditions and back-projected onto the mesh. This method is repeated for subsequent viewpoints, incorporating previous textures to render not only a depth image but additionally a partially coloured RGB image with uncolored masks. The model uses a depth-aware image inpainting encoder to fill uncolored areas, generating a whole coarse texture map by back-projecting inpainted images onto the 3D mesh.

For more complex scenes or objects, the model uses multiple views. Initially, it captures two depth maps from symmetric viewpoints and combines them right into a depth grid, which replaces a single depth image for multi-view depth-aware texture sampling.

Stage 2: Texture Refinement in UV Space

Despite generating logical coarse texture maps, challenges resembling texture holes from rendering processes and lighting shadows from 2D image diffusion models arise. To handle these, Paint3D performs a diffusion process in UV space based on the coarse texture map, enhancing the visual appeal and resolving issues.

Nevertheless, refining the feel map in UV space can introduce discontinuities on account of the fragmentation of continuous textures into individual fragments. To mitigate this, Paint3D refines the feel map through the use of the adjacency information of texture fragments. In UV space, the position map represents the 3D adjacency information of texture fragments, treating each non-background element as a 3D point coordinate. The model uses an extra position map encoder, just like ControlNet, to integrate this adjacency information through the diffusion process.

The model concurrently uses the position of the conditional encoder and other encoders to perform refinement tasks in UV space, offering two capabilities: UVHD (UV High Definition) and UV inpainting. UVHD enhances the visual appeal and aesthetics, using a picture enhancement encoder and position encoder with the diffusion model. UV inpainting fills texture holes, avoiding self-occlusion issues from rendering. The refinement stage starts with UV inpainting, followed by UVHD to supply a final refined texture map.

By integrating these refinement methods, the Paint3D framework generates complete, diverse, high-resolution, and lighting-free UV texture maps, making it a sturdy solution for texturing 3D objects.

Paint3D : Experiments and Results

The Paint3D model utilizes the Stable Diffusion text2image model to help with texture generation tasks, while the image encoder component manages image conditions. To reinforce its control over conditional tasks like image inpainting, depth handling, and high-definition imagery, the Paint3D framework employs ControlNet domain encoders. The model is implemented on the PyTorch framework, with rendering and texture projections executed on Kaolin.

Text to Textures Comparison

To guage Paint3D’s performance, we start by analyzing its texture generation when conditioned with textual prompts, comparing it against state-of-the-art frameworks resembling Text2Tex, TEXTure, and LatentPaint. As shown in the next image, the Paint3D framework not only excels at generating high-quality texture details but additionally effectively synthesizes an illumination-free texture map.

By leveraging the robust capabilities of Stable Diffusion and ControlNet encoders, Paint3D provides superior texture quality and flexibility. The comparison highlights Paint3D’s ability to supply detailed, high-resolution textures without embedded illumination, making it a number one solution for 3D texturing tasks.

Compared, the Latent-Paint framework is vulnerable to generating blurry textures that ends in suboptimal visual effects. Then again, although the TEXTure framework generates clear textures, it lacks smoothness and exhibits noticeable splicing and seams. Finally, the Text2Tex framework generates smooth textures remarkably well, but it surely fails to copy the performance for generating fantastic textures with intricate detailing.  The next image compares the Paint3D framework with state-of-the-art frameworks quantitatively. 

As it might probably be observed, the Paint3D framework outperforms all the prevailing models, and by a major margin with nearly 30% improvement within the FID baseline and roughly 40% improvement within the KID baseline. The advance within the FID and KID baseline scores display Paint3D’s ability to generate high-quality textures across diverse objects and categories. 

Image to Texture Comparison

To generate Paint3D’s generative capabilities using visual prompts, we use the TEXTure model because the baseline. As mentioned earlier, the Paint3D model employs a picture encoder sourced from the text2image model from Stable Diffusion. As it might probably be seen in the next image, the Paint3D framework synthesizes exquisite textures remarkably well, and continues to be able to take care of high fidelity w.r.t the image condition. 

Then again, the TEXTure framework is capable of generate a texture just like Paint3D, but it surely falls short to represent the feel details within the image condition accurately. Moreover, as demonstrated in the next image, the Paint3D framework delivers higher FID and KID baseline scores when put next to the TEXTure framework with the previous decreasing from 40.83 to 26.86 whereas the latter showing a drop from 9.76 to 4.94. 

Final Thoughts

In this text, we’ve talked about Paint3D,  a coarse-to-fine novel framework capable of manufacturing lighting-less, diverse, and high-resolution 2K UV texture maps for untextured 3D meshes conditioned either on visual or textual inputs. The principal highlight of the Paint3D framework is that it’s able to generating lighting-less high-resolution 2K UV textures which are semantically consistent without being conditioned on image or text inputs. Owing to its coarse-to-fine approach, the Paint3D framework produce lighting-less, diverse, and high-resolution texture maps, and delivers higher performance than current state-of-the-art frameworks. 

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x