VQ-Diffusion

-


Will Berman's avatar


Vector Quantized Diffusion (VQ-Diffusion) is a conditional latent diffusion model developed by the University of Science and Technology of China and Microsoft. Unlike mostly studied diffusion models, VQ-Diffusion’s noising and denoising processes operate on a quantized latent space, i.e., the latent space consists of a discrete set of vectors. Discrete diffusion models are less explored than their continuous counterparts and offer an interesting point of comparison with autoregressive (AR) models.



Demo

🧨 Diffusers permits you to run VQ-Diffusion with just a couple of lines of code.

Install dependencies

pip install 'diffusers[torch]' transformers ftfy

Load the pipeline

from diffusers import VQDiffusionPipeline

pipe = VQDiffusionPipeline.from_pretrained("microsoft/vq-diffusion-ithq")

If you wish to use FP16 weights

from diffusers import VQDiffusionPipeline
import torch

pipe = VQDiffusionPipeline.from_pretrained("microsoft/vq-diffusion-ithq", torch_dtype=torch.float16, revision="fp16")

Move to GPU

pipe.to("cuda")

Run the pipeline!

prompt = "A teddy bear playing within the pool."

image = pipe(prompt).images[0]

png



Architecture

svg



VQ-VAE

Images are encoded right into a set of discrete “tokens” or embedding vectors using a VQ-VAE encoder. To accomplish that, images are split in patches, after which each patch is replaced by the closest entry from a codebook with a fixed-size vocabulary. This reduces the dimensionality of the input pixel space. VQ-Diffusion uses the VQGAN variant from Taming Transformers. This blog post is a superb resource for higher understanding VQ-VAEs.

VQ-Diffusion uses a pre-trained VQ-VAE which was frozen through the diffusion training process.



Forward process

Within the forward diffusion process, each latent token can stay the identical, be resampled to a unique latent vector (each with equal probability), or be masked. Once a latent token is masked, it’ll stay masked. αt alpha_t



Approximating the reverse process

An encoder-decoder transformer approximates the classes of the un-noised latents, x0 x_0

The AR models section provides additional context on VQ-Diffusion’s architecture compared to AR transformer based models.

Taming Transformers provides a superb discussion on converting raw pixels to discrete tokens in a compressed latent space in order that transformers turn into computationally feasible for image data.



VQ-Diffusion in Context



Diffusion Models

Contemporary diffusion models are mostly continuous. Within the forward process, continuous diffusion models iteratively add Gaussian noise. The reverse process is approximated via pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t)) p_{theta}(x_{t-1} | x_t) = N(x_{t-1}; mu_{theta}(x_t, t), Sigma_{theta}(x_t, t))

The approximate reverse process is structurally just like the discrete reverse process. Nevertheless within the discrete case, there isn’t any clear analog for predicting the noise in xt x_t

There may be a smaller amount of literature covering discrete diffusion models than continuous diffusion models. Deep Unsupervised Learning using Nonequilibrium Thermodynamics introduces a diffusion model over a binomial distribution. Argmax Flows and Multinomial Diffusion extends discrete diffusion to multinomial distributions and trains a transformer for predicting the unnoised distribution for a language modeling task. Structured Denoising Diffusion Models in Discrete State-Spaces generalizes multinomial diffusion with alternative noising processes — uniform, absorbing, discretized Gaussian, and token embedding distance. Alternative noising processes are also possible in continuous diffusion models, but as noted within the paper, only additive Gaussian noise has received significant attention.



Autoregressive Models

It’s perhaps more interesting to check VQ-Diffusion to AR models as they more regularly feature transformers making predictions over discrete distributions. While transformers have demonstrated success in AR modeling, they still suffer from linear decreases in inference speed for increased image resolution, error accumulation, and directional bias. VQ-Diffusion improves on all three pain points.

AR image generative models are characterised by factoring the image probability such that every pixel is conditioned on the previous pixels in a raster scan order (left to right, top to bottom) i.e. p(x)=ip(xixi1,xi2,...x2,x1) p(x) = prod_i p(x_i | x_{i-1}, x_{i-2}, … x_{2}, x_{1})

AR image generative models have evolved architecturally with much work towards making transformers computationally feasible. Prior to transformer based models, PixelRNN, PixelCNN, and PixelCNN++ were the cutting-edge.

Image Transformer provides a superb discussion on the non-transformer based models and the transition to transformer based models (see paper for omitted citations).

Training recurrent neural networks to sequentially predict each pixel of even a small image is computationally very difficult. Thus, parallelizable models that use convolutional neural networks akin to the PixelCNN have recently received rather more attention, and have now surpassed the PixelRNN in quality.

One drawback of CNNs in comparison with RNNs is their typically fairly limited receptive field. This will adversely affect their ability to model long-range phenomena common in images, akin to symmetry and occlusion, especially with a small variety of layers. Growing the receptive field has been shown to enhance quality significantly (Salimans et al.). Doing so, nevertheless, comes at a big cost in variety of parameters and consequently computational performance and might make training such models more difficult.

… self-attention can achieve a greater balance within the trade-off between the virtually unlimited receptive field of the necessarily sequential PixelRNN and the limited receptive field of the rather more parallelizable PixelCNN and its various extensions.

Image Transformer uses transformers by restricting self attention over local neighborhoods of pixels.

Taming Transformers and DALL-E 1 mix convolutions and transformers. Each train a VQ-VAE to learn a discrete latent space, after which a transformer is trained within the compressed latent space. The transformer context is global but masked, because attention is provided over all previously predicted latent pixels, however the model remains to be AR so attention can’t be provided over not yet predicted pixels.

ImageBART combines convolutions, transformers, and diffusion processes. It learns a discrete latent space that’s further compressed with a brief multinomial diffusion process. Separate encoder-decoder transformers are then trained to reverse each step within the diffusion process. The encoder transformer provides global context on xt x_t

Despite having made tremendous strides, AR models still suffer from linear decreases in inference speed for increased image resolution, error accumulation, and directional bias. For equivalently sized AR transformer models, the big-O of VQ-Diffusion’s inference is best as long as the variety of diffusion steps is lower than the variety of latent pixels. For the ITHQ dataset, the latent resolution is 32×32 and the model is trained as much as 100 diffusion steps for an ~10x big-O improvement. In practice, VQ-Diffusion “will be 15 times faster than AR methods while achieving a greater image quality” (see paper for more details). Moreover, VQ-Diffusion doesn’t require teacher-forcing and as an alternative learns to correct incorrectly predicted tokens. During training, noised images are each masked and have latent pixels replaced with random tokens. VQ-Diffusion can be capable of provide global context on xt x_t



Further steps with VQ-Diffusion and 🧨 Diffusers

Thus far, we have only ported the VQ-Diffusion model trained on the ITHQ dataset. There are also released VQ-Diffusion models trained on CUB-200, Oxford-102, MSCOCO, Conceptual Captions, LAION-400M, and ImageNet.

VQ-Diffusion also supports a faster inference strategy. The network reparameterization relies on the posterior of the diffusion process conditioned on the un-noised image being tractable. The same formula applies when using a time stride, Δt Delta t , that skips numerous reverse diffusion steps, pθ(xtΔtxt,y)=x~0=1Kq(xtΔtxt,x~0)pθ(x~0xt,y) p_theta (x_{t – Delta t } | x_t, y) = sum_{tilde{x}_0=1}^{K}{q(x_{t – Delta t} | x_t, tilde{x}_0)} p_theta(tilde{x}_0 | x_t, y)

Improved Vector Quantized Diffusion Models improves upon VQ-Diffusion’s sample quality with discrete classifier-free guidance and an alternate inference strategy to handle the “joint distribution issue” — see section 3.2 for more details. Discrete classifier-free guidance is merged into diffusers but the choice inference strategy has not been added yet.

Contributions are welcome!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x