SHOW-O: A Single Transformer Uniting Multimodal Understanding and Generation

Significant advancements in large language models (LLMs) have inspired the event of multimodal large language models (MLLMs). Early MLLM efforts, equivalent to LLaVA, MiniGPT-4, and InstructBLIP, show notable multimodal understanding capabilities. To integrate LLMs into multimodal domains, these studies explored projecting features from a pre-trained modality-specific encoder, equivalent to CLIP, into the input space of LLMs, enabling multimodal understanding and reasoning throughout the transformer backbone. Although there are numerous design decisions for MLLMs, equivalent to vision encoders, feature alignment adapters, and datasets, the training for many of those models adheres to the autoregressive generation paradigm, which has proven effective for text generation in LLMs. Despite their strong multimodal understanding capabilities, these models primarily concentrate on visual perception and lack the flexibility to generate multimodal outputs beyond text.

Transformer models have demonstrated great success in autoregressive modeling in natural language processing. Inspired by such progress, previous studies have directly applied the identical autoregressive modeling to learn the dependency of image pixels for image and video generation. For example, VideoPoet employs a decoder-only transformer architecture to synthesize high-quality videos from multimodal inputs. More recently, LlamaGen has shown that a big language model architecture like Llama can autoregressively model image tokens, achieving decent performance in class-conditional image generation.

In this text, we’ll discuss Show-O, a unified transformer that integrates multimodal understanding and generation. Unlike fully autoregressive models, Show-O unifies autoregressive and discrete diffusion modeling to adaptively handle inputs and outputs of varied and mixed modalities. The unified model flexibly supports a wide selection of vision-language tasks, including visual query answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, Show-O demonstrates comparable or superior performance to existing individual models with an equivalent or larger variety of parameters, highlighting its potential as a next-generation foundation model.

On this framework, the model is tasked with predicting Gaussian noise added to the continual latent representations. In contrast, other models like D3PM, Mask-predict, ARDM, and MaskGIT use a discrete corruption process as an alternative choice to Gaussian diffusion. Specifically, a picture is represented as a sequence of discrete tokens using image tokenizers, with each token related to a categorical label. The token-wise distribution is transformed right into a uniform distribution through a stochastic sampling process. During training, a portion of those tokens is randomly masked, and the model is trained to predict the unique values of the masked tokens. On this work, Show-O adopts discrete diffusion modeling for visual generation.

Over the past few years, significant advancements have emerged within the two key pillars of multimodal intelligence: understanding and generation. For multimodal understanding, Multimodal Large Language Models (MLLMs) like LLaVA have demonstrated exceptional capabilities in vision-language tasks equivalent to visual question-answering (VQA). For visual generation, denoising diffusion probabilistic models (DDPMs) have revolutionized traditional generative paradigms, achieving unprecedented performance in text-to-image/video generation.

Given these achievements in individual fields, it’s natural to explore the potential of connecting them. Recent works have tried to assemble expert models from these two different domains to form a unified system that may handle each multimodal understanding and generation. Nonetheless, existing attempts often involve separate models for understanding and generation. For example, NExT-GPT employs a base language model for multimodal understanding but requires an extra pre-trained diffusion model for image generation. This raises the query: can one single transformer handle each multimodal understanding and generation?

Recently, Chameleon demonstrated that this is feasible. Specifically, Chameleon enables the fusion of various modalities to generate each text and image tokens through autoregressive modeling. While it is sensible to model text tokens autoregressively, it’s less clear whether modeling image patches or pixels in the identical way is perfect. A key bottleneck of autoregressively predicting a picture is the big variety of sampling steps required, especially when coping with higher resolution images. Continuous diffusion models have shown superior performance in visual generation in comparison with autoregressive ones.

This leads us to explore whether a single transformer can integrate each autoregressive and diffusion modeling. Show-O envisions a brand new paradigm where text is represented as discrete tokens and modeled autoregressively, while continuous image pixels are modeled using denoising diffusion. Nonetheless, integrating these two distinct techniques right into a single network is non-trivial resulting from the differences between discrete text tokens and continuous image representations. Moreover, diffusion models typically depend on two distinct models: a text encoder and a denoising network.

To deal with this, Show-O introduces a novel unified model able to handling each multimodal understanding and generation tasks using mixed autoregressive and diffusion modeling. Show-O is built upon a pre-trained LLM and leverages its autoregressive modeling capabilities for text-based reasoning. Inspired by other works, Show-O employs discrete denoising diffusion to model image tokens as an alternative of continuous representations. Furthermore, Show-O inherently encodes text conditional information, eliminating the necessity for added text encoders. By utilizing text and image tokenizers, Show-O can process diverse input data and tasks, providing answers autoregressively for vision-language tasks and generating images using discrete denoising diffusion.

Show-O demonstrates comparable, and in some cases higher, performance than individual models with an equivalent or larger variety of parameters across various benchmarks. Unlike autoregressive image generation, the Show-O framework requires about 20 times fewer sampling steps, making it inherently faster. Moreover, the Show-O framework supports downstream applications like text-guided inpainting and extrapolation without requiring fine-tuning, as demonstrated in the next image.

Show-O also has the potential for mixed-modality generation, equivalent to interleaved video keyframe generation with text descriptions, showing promise for long-form video generation. Moreover, the Show-O framework investigates the impact of discrete and continuous image representations on multimodal understanding, offering insights for future unified model designs.

The next figure presents a comparison of model characteristics between the Show-O framework and existing methods across various domains. Show-O stands out as a unified model that integrates advanced techniques for each multimodal understanding and generation.

In summary, the important contributions of this paper are as follows:

Show-O is a unified model that integrates multimodal understanding and generation using a single transformer.
Show-O unifies autoregressive and discrete diffusion modeling inside one transformer, handling each text and pictures effectively.
The Show-O framework outperforms or matches individual baseline models with equivalent or larger parameters across multimodal understanding and generation benchmarks.
Show-O supports downstream applications like text-based inpainting and extrapolation without fine-tuning and demonstrates potential for mixed-modality generation.
Show-O explores the impact of several types of representations, providing worthwhile insights for improving multimodal understanding in unified models.

In recent times, an increasing variety of studies have focused on unified multimodal language models able to each comprehension and generation. Some efforts use continuous representations interleaved with text tokens for autoregressive modeling to generate images. SEED-X proposes a unified and versatile foundation system able to handling each multimodal understanding and generation tasks. On this approach, continuous image representations from the CLIP ViT encoder are combined with text tokens and fed right into a large language model (LLM) to perform next-word prediction and image representation regression. Chameleon introduces a family of token-based mixed-modal models able to each comprehending and generating images. This approach represents all modalities as discrete tokens, utilizing a unified transformer-based architecture and training the model from scratch in an end-to-end manner. As compared, Show-O also adopts discrete tokens to represent all modalities but utilizes a discrete diffusion process as an alternative of autoregressive modeling for visual generation.

SHOW-O: Methodology and Architecture

The first objective behind the Show-O framework is to develop a unified model that integrates autoregressive and diffusion modeling for joint multimodal understanding and generation. Developing such a unified model poses significant challenges, with core issues revolving around: i) defining the model’s input/output space; ii) unifying various sorts of input data from different modalities; iii) integrating each autoregressive and diffusion modeling right into a single transformer; and iv) effectively training such a unified model.

Show-O addresses these challenges with the next solutions:

Show-O constructs the input/output space by tokenizing text and image data into discrete tokens.
Show-O introduces its default architecture and a unified prompting technique to structure input data and modalities.
Show-O demonstrates easy methods to incorporate each autoregressive and diffusion modeling inside a single transformer.
Show-O presents a three-stage training pipeline to effectively train the unified model.

Tokenization

Provided that the proposed Show-O is built upon pre-trained LLMs, it’s natural to perform unified learning within the discrete space. By maintaining a unified vocabulary that features discrete text and image tokens, Show-O is tasked with the identical learning objective: predicting discrete tokens.

Text Tokenization

Show-O relies on a pre-trained LLM, and the identical tokenizer is used for text data tokenization with none modifications.

Image Tokenization

Following MAGVIT-v2, Show-O trains a lookup-free quantizer using around 35M image data. The quantizer maintains a codebook of size 8,192 and encodes images of 256×256 resolution into 16×16 discrete tokens. MAGVIT-v2 is chosen for its ease of fine-tuning, making it suitable as a video tokenizer with temporal compression capability, a side Show-O plans to explore in the long run. An alternate approach is to make use of different tokenizers for understanding and generation, respectively. Inspired by existing studies, Show-O also extracts continuous image representations from the pre-trained MAGVIT-v2 and CLIP-ViT encoder to explore improvements in multimodal understanding capabilities.. In the next sections, the default Show-O employs discrete image tokens as input for each multimodal understanding and generation. For simplicity, the methodology sections will elaborate only on the default Show-O.

Architecture

Show-O inherits the architecture of existing LLMs with none architecture modifications, apart from prepending a QK-Norm operation to every attention layer. Show-O is initialized with the weights of a pre-trained LLM and expands the scale of the embedding layer by incorporating 8,192 latest learnable embeddings for discrete image tokens. Unlike state-of-the-art diffusion models that require an extra text encoder, Show-O inherently encodes text conditional information for text-to-image generation.

Unified Prompting

To perform unified learning on multimodal understanding and generation, Show-O utilizes a unified prompting technique to format various sorts of input data. Given an image-text pair (x, y), it’s first tokenized into M image tokens and N text tokens by the image and text tokenizers, respectively. The tokens are then formed into an input sequence in keeping with the duty type, as illustrated in the next figure.

By employing this prompt design, Show-O can effectively encode various input data for multimodal understanding, text-to-image generation, and mixed-modality generation as sequential data. This setup enables unified learning to operate seamlessly across sequences for these various tasks. Once trained, Show-O may be prompted to handle a wide selection of vision-language tasks, including visual query answering and text-to-image generation.

Omni-Attention Mechanism

Unlike existing works that model sequences autoregressively only, Show-O introduces an omni-attention mechanism, enabling it to model various sorts of signals in distinct ways. This comprehensive attention mechanism adaptively switches between causal and full attention based on the format of the input sequence. The next figure illustrates examples of omni-attention for various input sequences.

Specifically, Show-O processes text tokens throughout the sequence via causal attention, while image tokens are handled using full attention, allowing each token to comprehensively interact with all others. In multimodal understanding, text tokens can attend to all previous image tokens, while in text-to-image generation, image tokens can interact with all preceding text tokens. Omni-attention retains the text reasoning knowledge from the pre-trained LLM and enhances the efficiency of image generation by reducing sampling steps. Moreover, it supports various downstream applications, equivalent to inpainting and extrapolation, without requiring fine-tuning. When given only text tokens, the mechanism defaults to causal attention.

SHOW-O: Experiments and Results

The next table presents the multimodal understanding capability of Show-O on public benchmarks, equivalent to image captioning and visual question-answering tasks.

The present version of Show-O is built upon Phi-1.5, and subsequently, Show-O’s understanding-only counterpart, LLaVA-v1.5-Phi-1.5, serves because the direct baseline. Show-O exhibits comparable performance in all evaluation metrics to the baseline LLaVA-v1.5-Phi-1.5, which is devoted solely to multimodal understanding. This demonstrates the good potential of the Show-O framework to unify multimodal understanding and generation inside a single transformer. When put next to understanding-only models like InstructBLIP, Qwen-VL-Chat, and mPLUG-Owl2, Show-O, despite having a much smaller model size, achieves competitive performance on the POPE, MME, Flickr30k, and VQAv2 benchmarks, and performs higher on the GQA benchmark. When put next to unified models with significantly more parameters, equivalent to NExT-GPT-13B and Chameleon-34B, Show-O also achieves strong performance on the Flickr30k benchmark and performs a lot better on the VQAv2 benchmark.

Given these promising results, Show-O is envisioned as a possible next-generation foundation model for unifying understanding and generation. These results also show the potential of scaling Show-O to attain state-of-the-art performance.

Qualitative Comparisons

We present qualitative comparisons with diffusion-based models, equivalent to SDv1.5, SDXL, and the autoregressive-based model LlamaGen, alongside unified models like LWM and SEED-X, as demonstrated in the next figure.

Show-O demonstrates the flexibility to generate realistic images with consistent content described in each short and long text prompts. In comparison with SDv1.5 and LlamaGen, Show-O exhibits higher visual quality and stronger image-text alignment. For example, within the second column, each SDv1.5 and LlamaGen fail to completely comprehend the text prompt and miss attributes equivalent to “sunset” and “blue domes” within the generated images. As compared to SDXL, Show-O provides comparable visual quality and alignment, as seen in examples like “a rally automobile race” and “stunning contrast against the colourful sunset.”

Text-Guided Inpainting and Extrapolation

Show-O naturally supports text-based inpainting and extrapolation without requiring any fine-tuning. The next figure illustrates several examples.

At the highest of the figure, given an input image and an inpainting mask, Show-O can transform a red trolley automobile right into a blue sports automobile with sleek curves and tinted windows based on a user-provided text prompt. Show-O may extrapolate the unique image horizontally or vertically based on the given text prompt. For example, within the second row, Show-O extrapolates a picture by adding latest objects, like “red wildflowers.” The pixels in each the in-painted and extrapolated regions remain consistent with the unique image. These examples clearly show the inherent benefits of Show-O over autoregressive models for downstream applications.

Final Thoughts

In this text we now have talked about Show-O, a unified transformer that integrates multimodal understanding and generation. Unlike fully autoregressive models, Show-O unifies autoregressive and discrete diffusion modeling to adaptively handle inputs and outputs of varied and mixed modalities. The unified model flexibly supports a wide selection of vision-language tasks, including visual query answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, Show-O demonstrates comparable or superior performance to existing individual models with an equivalent or larger variety of parameters, highlighting its potential as a next-generation foundation model. On this framework, the model is tasked with predicting Gaussian noise added to the continual latent representations. In contrast, other models like D3PM, Mask-predict, ARDM, and MaskGIT use a discrete corruption process as an alternative choice to Gaussian diffusion. Show-O is the primary to unify autoregressive and discrete diffusion modeling, enabling it to handle different modalities in distinct ways. Extensive experimental results show that Show-O is comparable to, and even higher than, individual expert models across a wide selection of vision-language tasks. This highlights its potential as a next-generation foundation model.

SHOW-O: A Single Transformer Uniting Multimodal Understanding and Generation