Uni3D: Exploring Unified 3D Representation at Scale

Artificial Intelligence

Uni3D: Exploring Unified 3D Representation at Scale

admin

October 28, 2023

Uni3D: Exploring Unified 3D Representation at Scale

Scaling up representations of text and visuals has been a significant focus of research lately. Developments and research conducted within the recent past have led to quite a few revolutions in language learning and vision. Nonetheless, despite the recognition of scaling text and visual representations, the scaling of representations for 3D scenes and objects has not been sufficiently discussed.

Today, we’ll discuss Uni3D, a 3D foundation model that goals to explore unified 3D representations. The Uni3D framework employs a 2D-initialized ViT framework, pretrained end-to-end, to align image-text features with their corresponding 3D point cloud features.

The Uni3D framework uses pretext tasks and an easy architecture to leverage the abundance of pretrained 2D models and image-text-aligned models as initializations and targets, respectively. This approach unleashes the complete potential of 2D models and techniques to scale them to the 3D world.

In this text, we’ll delve deeper into 3D computer vision and the Uni3D framework, exploring the essential concepts and the architecture of the model. So, let’s begin.

Previously few years, computer vision has emerged as probably the most heavily invested domains within the AI industry. Following significant advancements in 2D computer vision frameworks, developers have shifted their focus to 3D computer vision. This field, particularly 3D representation learning, merges facets of computer graphics, machine learning, computer vision, and arithmetic to automate the processing and understanding of 3D geometry. The rapid development of 3D sensors like LiDAR, together with their widespread applications within the AR/VR industry, has resulted in 3D representation learning gaining increased attention. Its potential applications proceed to grow each day.

Although existing frameworks have shown remarkable progress in 3D model architecture, task-oriented modeling, and learning objectives, most explore 3D architecture on a comparatively small scale with limited data, parameters, and task scenarios. The challenge of learning scalable 3D representations, which might then be applied to real-time applications in diverse environments, stays largely unexplored.

Moving along, prior to now few years, scaling large language models which might be pre-trained has helped in revolutionizing the natural language processing domain, and up to date works have indicated a translation within the progress to 2D from language using data and model scaling which makes way for developers to try & reattempt this success to learn a 3D representation that might be scaled & be transferred to applications in real world.

Uni3D is a scalable and unified pretraining 3D framework developed with the aim to learn large-scale 3D representations that tests its limits at the dimensions of over a billion parameters, over 10 million images paired with over 70 million texts, and over 1,000,000 3D shapes. The figure below compares the zero-shot accuracy against parameters within the Uni3D framework. The Uni3D framework successfully scales 3D representations from 6 million to over a billion.

The Uni3D framework consists of a 2D ViT or Vision Transformer because the 3D encoder that’s then pre-trained end-to-end to align the image-text aligned features with the 3D point cloud features. The Uni3D framework makes use of pretext tasks and easy architecture to leverage the abundance of pretrained 2D models and image text aligned models as initialization and targets respectively, thus unleashing the complete potential of 2D models, and techniques to scale them to the 3D world. The flexibleness & scalability of the Uni3D framework is measured by way of

Scaling the model from 6M to over a billion parameters.
2D initialization to text supervised from visual self-supervised learning.
Text-image goal model scaling from 150 million to over a billion parameters.

Under the flexible and unified framework offered by Uni3D, developers observe a coherent boost within the performance in the case of scaling each component. The big-scale 3D representation learning also advantages immensely from the sharable 2D and scale-up strategies.

As it will probably be seen within the figure below, the Uni3D framework displays a lift within the performance compared to prior art in few-shot and zero-shot settings. It’s price noting that the Uni3D framework returns a zero-shot classification accuracy rating of over 88% on ModelNet which is at par with the performance of several state-of-the-art supervision methods.

Moreover, the Uni3D framework also delivers top notch accuracy & performance when performing other representative 3D tasks like part segmentation, and open world understanding. The Uni3D framework goals to bridge the gap between 2D vision and 3D vision by scaling 3D foundational models with a unified yet easy pre-training approach to learn more robust 3D representations across a big selection of tasks, that may ultimately assist in the convergence of 2D and 3D vision across a big selection of modalities.

Uni3D : Related Work

The Uni3D framework draws inspiration, and learns from the developments made by previous 3D representation learning, and Foundational models especially under different modalities.

3D Representation Learning

The 3D representation learning method uses cloud points for 3D understanding of the article, and this field has been explored by developers rather a lot within the recent past, and it has been observed that these cloud points might be pre-trained under self-supervision using specific 3D pretext tasks including mask point modeling, self-reconstruction, and contrastive learning.

It’s price noting that these methods work with limited data, and so they often don’t investigate multimodal representations to 3D from 2D or NLP. Nonetheless, the recent success of the CLIP framework that returns high efficiency in learning visual concepts from raw text using the contrastive learning method, and further seeks to learn 3D representations by aligning image, text, and cloud point features using the identical contrastive learning method.

Foundation Models

Developers have exhaustively been working on designing foundation models to scale up and unify multimodal representations. For instance, within the NLP domain, developers have been working on frameworks that may scale up pre-trained language models, and it’s slowly revolutionizing the NLP industry. Moreover, advancements might be observed within the 2D vision domain as well because developers are working on frameworks that use data & model scaling techniques to assist in the progress of language to 2D models, although such frameworks are difficult to duplicate for 3D models due to the limited availability of 3D data, and the challenges encountered when unifying & scaling up the 3D frameworks.

By learning from the above two work domains, developers have created the Uni3D framework, the primary 3D foundation model with over a billion parameters that makes use of a unified ViT or Vision Transformer architecture that permits developers to scale the Uni3D model using unified 3D or NLP strategies for scaling up the models. Developers hope that this method will allow the Uni3D framework to bridge the gap that currently separates 2D and 3D vision together with facilitating multimodal convergence.

Uni3D : Method and Architecture

The above image demonstrates the generic overview of the Uni3D framework, a scalable and unified pre-training 3D framework for large-scale 3D representation learning. Developers make use of over 70 million texts, and 10 million images paired with over 1,000,000 3D shapes to scale the Uni3D framework to over a billion parameters. The Uni3D framework uses a 2D ViT or Vision Transformer as a 3D encoder that’s then trained end-to-end to align the text-image data with the 3D cloud point features, allowing the Uni3D framework to deliver the specified efficiency & accuracy across a big selection of benchmarks. Allow us to now have an in depth have a look at the working of the Uni3D framework.

Scaling the Uni3D Framework

Prior studies on cloud point representation learning have traditionally focused heavily on designing particular model architectures that deliver higher performance across a big selection of applications, and work on a limited amount of knowledge due to small-scale datasets. Nonetheless, recent studies have tried exploring the opportunity of using scalable pre-training in 3D but there have been no major outcomes due to the provision of limited 3D data. To resolve the scalability problem of 3D frameworks, the Uni3D framework leverages the facility of a vanilla transformer structure that just about mirrors a Vision Transformer, and might solve the scaling problems by utilizing unified 2D or NLP scaling-up strategies to scale the model size.

Uni3D : Related Work

3D Representation Learning

Foundation Models

Uni3D : Method and Architecture

Scaling the Uni3D Framework

LEAVE A REPLY Cancel reply