Vector Quantized Diffusion (VQ-Diffusion) is a conditional latent diffusion model developed by the University of Science and Technology of China and Microsoft. Unlike mostly studied diffusion models, VQ-Diffusion’s noising and denoising processes operate on a quantized latent space, i.e., the latent space consists of a discrete set of vectors. Discrete diffusion models are less explored than their continuous counterparts and offer an interesting point of comparison with autoregressive (AR) models.
Demo
🧨 Diffusers permits you to run VQ-Diffusion with just a couple of lines of code.
Install dependencies
pip install 'diffusers[torch]' transformers ftfy
Load the pipeline
from diffusers import VQDiffusionPipeline
pipe = VQDiffusionPipeline.from_pretrained("microsoft/vq-diffusion-ithq")
If you wish to use FP16 weights
from diffusers import VQDiffusionPipeline
import torch
pipe = VQDiffusionPipeline.from_pretrained("microsoft/vq-diffusion-ithq", torch_dtype=torch.float16, revision="fp16")
Move to GPU
pipe.to("cuda")
Run the pipeline!
prompt = "A teddy bear playing within the pool."
image = pipe(prompt).images[0]
Architecture
VQ-VAE
Images are encoded right into a set of discrete “tokens” or embedding vectors using a VQ-VAE encoder. To accomplish that, images are split in patches, after which each patch is replaced by the closest entry from a codebook with a fixed-size vocabulary. This reduces the dimensionality of the input pixel space. VQ-Diffusion uses the VQGAN variant from Taming Transformers. This blog post is a superb resource for higher understanding VQ-VAEs.
VQ-Diffusion uses a pre-trained VQ-VAE which was frozen through the diffusion training process.
Forward process
Within the forward diffusion process, each latent token can stay the identical, be resampled to a unique latent vector (each with equal probability), or be masked. Once a latent token is masked, it’ll stay masked. , , and are hyperparameters that control the forward diffusion process from step to step . is the probability an unmasked token becomes masked. is the probability an unmasked token stays the identical. The token can transition to any individual non-masked latent vector with a probability of . In other words, where is the variety of non-masked latent vectors. See section 4.1 of the paper for more details.
Approximating the reverse process
An encoder-decoder transformer approximates the classes of the un-noised latents, , conditioned on the prompt, . The encoder is a CLIP text encoder with frozen weights. The decoder transformer provides unmasked global attention to all latent pixels and outputs the log probabilities of the explicit distribution over vector embeddings. The decoder transformer predicts the whole distribution of un-noised latents in a single forward pass, providing global self-attention over . Framing the issue as conditional sequence to sequence over discrete values provides some intuition for why the encoder-decoder transformer is a superb fit.
The AR models section provides additional context on VQ-Diffusion’s architecture compared to AR transformer based models.
Taming Transformers provides a superb discussion on converting raw pixels to discrete tokens in a compressed latent space in order that transformers turn into computationally feasible for image data.
VQ-Diffusion in Context
Diffusion Models
Contemporary diffusion models are mostly continuous. Within the forward process, continuous diffusion models iteratively add Gaussian noise. The reverse process is approximated via . Within the simpler case of DDPM, the covariance matrix is fixed, a U-Net is trained to predict the noise in , and is derived from the noise.
The approximate reverse process is structurally just like the discrete reverse process. Nevertheless within the discrete case, there isn’t any clear analog for predicting the noise in , and directly predicting the distribution for is a more clear objective.
There may be a smaller amount of literature covering discrete diffusion models than continuous diffusion models. Deep Unsupervised Learning using Nonequilibrium Thermodynamics introduces a diffusion model over a binomial distribution. Argmax Flows and Multinomial Diffusion extends discrete diffusion to multinomial distributions and trains a transformer for predicting the unnoised distribution for a language modeling task. Structured Denoising Diffusion Models in Discrete State-Spaces generalizes multinomial diffusion with alternative noising processes — uniform, absorbing, discretized Gaussian, and token embedding distance. Alternative noising processes are also possible in continuous diffusion models, but as noted within the paper, only additive Gaussian noise has received significant attention.
Autoregressive Models
It’s perhaps more interesting to check VQ-Diffusion to AR models as they more regularly feature transformers making predictions over discrete distributions. While transformers have demonstrated success in AR modeling, they still suffer from linear decreases in inference speed for increased image resolution, error accumulation, and directional bias. VQ-Diffusion improves on all three pain points.
AR image generative models are characterised by factoring the image probability such that every pixel is conditioned on the previous pixels in a raster scan order (left to right, top to bottom) i.e. . Consequently, the models will be trained by directly maximizing the log-likelihood. Moreover, AR models which operate on actual pixel (non-latent) values, predict channel values from a discrete multinomial distribution i.e. first the red channel value is sampled from a 256 way softmax, after which the green channel prediction is conditioned on the red channel value.
AR image generative models have evolved architecturally with much work towards making transformers computationally feasible. Prior to transformer based models, PixelRNN, PixelCNN, and PixelCNN++ were the cutting-edge.
Image Transformer provides a superb discussion on the non-transformer based models and the transition to transformer based models (see paper for omitted citations).
Training recurrent neural networks to sequentially predict each pixel of even a small image is computationally very difficult. Thus, parallelizable models that use convolutional neural networks akin to the PixelCNN have recently received rather more attention, and have now surpassed the PixelRNN in quality.
One drawback of CNNs in comparison with RNNs is their typically fairly limited receptive field. This will adversely affect their ability to model long-range phenomena common in images, akin to symmetry and occlusion, especially with a small variety of layers. Growing the receptive field has been shown to enhance quality significantly (Salimans et al.). Doing so, nevertheless, comes at a big cost in variety of parameters and consequently computational performance and might make training such models more difficult.
… self-attention can achieve a greater balance within the trade-off between the virtually unlimited receptive field of the necessarily sequential PixelRNN and the limited receptive field of the rather more parallelizable PixelCNN and its various extensions.
Image Transformer uses transformers by restricting self attention over local neighborhoods of pixels.
Taming Transformers and DALL-E 1 mix convolutions and transformers. Each train a VQ-VAE to learn a discrete latent space, after which a transformer is trained within the compressed latent space. The transformer context is global but masked, because attention is provided over all previously predicted latent pixels, however the model remains to be AR so attention can’t be provided over not yet predicted pixels.
ImageBART combines convolutions, transformers, and diffusion processes. It learns a discrete latent space that’s further compressed with a brief multinomial diffusion process. Separate encoder-decoder transformers are then trained to reverse each step within the diffusion process. The encoder transformer provides global context on while the decoder transformer autoregressively predicts latent pixels in . Consequently, each pixel receives global cross attention on the more noised image. Between 2-5 diffusion steps are used with more steps for more complex datasets.
Despite having made tremendous strides, AR models still suffer from linear decreases in inference speed for increased image resolution, error accumulation, and directional bias. For equivalently sized AR transformer models, the big-O of VQ-Diffusion’s inference is best as long as the variety of diffusion steps is lower than the variety of latent pixels. For the ITHQ dataset, the latent resolution is 32×32 and the model is trained as much as 100 diffusion steps for an ~10x big-O improvement. In practice, VQ-Diffusion “will be 15 times faster than AR methods while achieving a greater image quality” (see paper for more details). Moreover, VQ-Diffusion doesn’t require teacher-forcing and as an alternative learns to correct incorrectly predicted tokens. During training, noised images are each masked and have latent pixels replaced with random tokens. VQ-Diffusion can be capable of provide global context on while predicting .
Further steps with VQ-Diffusion and 🧨 Diffusers
Thus far, we have only ported the VQ-Diffusion model trained on the ITHQ dataset. There are also released VQ-Diffusion models trained on CUB-200, Oxford-102, MSCOCO, Conceptual Captions, LAION-400M, and ImageNet.
VQ-Diffusion also supports a faster inference strategy. The network reparameterization relies on the posterior of the diffusion process conditioned on the un-noised image being tractable. The same formula applies when using a time stride, , that skips numerous reverse diffusion steps, .
Improved Vector Quantized Diffusion Models improves upon VQ-Diffusion’s sample quality with discrete classifier-free guidance and an alternate inference strategy to handle the “joint distribution issue” — see section 3.2 for more details. Discrete classifier-free guidance is merged into diffusers but the choice inference strategy has not been added yet.
Contributions are welcome!

