Memory-efficient Diffusion Transformers with Quanto and Diffusers

Over the past few months, now we have seen an emergence in using Transformer-based diffusion backbones for high-resolution text-to-image (T2I) generation. These models use the transformer architecture because the constructing block for the diffusion process, as a substitute of the UNet architecture that was prevalent in most of the initial diffusion models. Due to the character of Transformers, these backbones show good scalability, with models starting from 0.6B to 8B parameters.

As models grow to be larger, memory requirements increase. The issue intensifies because a diffusion pipeline normally consists of several components: a text encoder, a diffusion backbone, and a picture decoder. Moreover, modern diffusion pipelines use multiple text encoders – for instance, there are three within the case of Stable Diffusion 3. It takes 18.765 GB of GPU memory to run SD3 inference using FP16 precision.

These high memory requirements could make it difficult to make use of these models with consumer GPUs, slowing adoption and making experimentation harder. On this post, we show learn how to improve the memory efficiency of Transformer-based diffusion pipelines by leveraging Quanto’s quantization utilities from the Diffusers library.

Preliminaries

For an in depth introduction to Quanto, please check with this post. Briefly, Quanto is a quantization toolkit built on PyTorch. It’s a part of Hugging Face Optimum, a set of tools for hardware optimization.

Model quantization is a preferred tool amongst LLM practitioners, but not a lot with diffusion models. Quanto will help bridge this gap and supply memory savings with little or no quality degradation.

For benchmarking purposes, we use an H100 GPU with the next environment:

Unless otherwise specified, we default to performing computations in FP16. We selected to not quantize the VAE to stop numerical instability issues. Our benchmarking code could be found here.

On the time of this writing, now we have the next Transformer-based diffusion pipelines for text-to-image generation in Diffusers:

We even have Latte, a Transformer-based text-to-video generation pipeline.

For brevity, we keep our study limited to the next three: PixArt-Sigma, Stable Diffusion 3, and Aura Flow. The table below shows the parameter counts of their diffusion backbones:

It’s value keeping in mind that this post primarily focuses on memory efficiency at a slight or negligible cost of inference latency.

Quantizing a `DiffusionPipeline` with Quanto

Quantizing a model with Quanto is simple.

from optimum.quanto import freeze, qfloat8, quantize
from diffusers import PixArtSigmaPipeline
import torch

pipeline = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", torch_dtype=torch.float16
).to("cuda")

quantize(pipeline.transformer, weights=qfloat8)
freeze(pipeline.transformer)

We call quantize() on the module to be quantized, specifying what we would like to quantize. Within the above case, we are only quantizing the parameters, leaving the activations as is. We’re quantizing to the FP8 data-type. We finally call freeze() to switch the unique parameters with the quantized parameters.

We are able to then call this pipeline normally:

image = pipeline("ghibli style, a fantasy landscape with castles").images[0]

FP16	Diffusion Transformer in FP8

We notice the next memory savings when using FP8, with barely higher latency and almost no quality degradation:

Batch Size	Quantization	Memory (GB)	Latency (Seconds)
1	None	12.086	1.200
1	FP8	11.547	1.540
4	None	12.087	4.482
4	FP8	11.548	5.109

We are able to quantize the text encoder in the identical way:

quantize(pipeline.text_encoder, weights=qfloat8)
freeze(pipeline.text_encoder)

The text encoder can be a transformer model, and we are able to quantize it too. Quantizing each the text encoder and the diffusion backbone results in much larger memory improvements:

Batch Size	Quantization	Quantize TE	Memory (GB)	Latency (Seconds)
1	FP8	False	11.547	1.540
1	FP8	True	5.363	1.601
4	FP8	False	11.548	5.109
4	FP8	True	5.364	5.141

Quantizing the text encoder produces results very much like the previous case:

Generality of the observations

Quantizing the text encoder along with the diffusion backbone generally works for the models we tried. Stable Diffusion 3 is a special case, because it uses three different text encoders. We found that quantizing the second text encoder doesn’t work well, so we recommend the next alternatives:

The table below gives an idea concerning the expected memory savings for various text encoder quantization mixtures (the diffusion transformer is quantized in all cases):

Batch Size	Quantization	Quantize TE 1	Quantize TE 2	Quantize TE 3	Memory (GB)	Latency (Seconds)
1	FP8	1	1	1	8.200	2.858
1 ✅	FP8	0	0	1	8.294	2.781
1	FP8	1	1	0	14.384	2.833
1	FP8	0	1	0	14.475	2.818
1 ✅	FP8	1	0	0	14.384	2.730
1	FP8	0	1	1	8.325	2.875
1 ✅	FP8	1	0	1	8.204	2.789
1	None	–	–	–	16.403	2.118

Quantized Text Encoder: 1	Quantized Text Encoder: 3	Quantized Text Encoders: 1 and three

Misc findings

`bfloat16` will likely be higher on H100

Using bfloat16 could be faster for supported GPU architectures, similar to H100 or 4090. The table below presents some numbers for PixArt measured on our H100 reference hardware:

Batch Size	Precision	Quantization	Memory (GB)	Latency (Seconds)	Quantize TE
1	FP16	INT8	5.363	1.538	True
1	BF16	INT8	5.364	1.454	True
1	FP16	FP8	5.363	1.601	True
1	BF16	FP8	5.363	1.495	True

The promise of `qint8`

We found quantizing with qint8 (as a substitute of qfloat8) is mostly higher by way of inference latency. This effect gets more pronounced after we horizontally fuse the eye QKV projections (calling fuse_qkv_projections() in Diffusers), thereby thickening the size of the int8 kernels to hurry up computation. We present some evidence below for PixArt:

Batch Size	Quantization	Memory (GB)	Latency (Seconds)	Quantize TE	QKV Projection
1	INT8	5.363	1.538	True	False
1	INT8	5.536	1.504	True	True
4	INT8	5.365	5.129	True	False
4	INT8	5.538	4.989	True	True

How about INT4?

We moreover experimented with qint4 when using bfloat16. This is barely applicable to bfloat16 on H100 because other configurations are usually not supported yet. With qint4, we are able to expect to see more improvements in memory consumption at the associated fee of increased inference latency. Increased latency is predicted, because there isn’t a native hardware support for int4 computation – the weights are transferred using 4 bits, but computation remains to be done in bfloat16. The table below shows our results for PixArt-Sigma:

Batch Size	Quantize TE	Memory (GB)	Latency (Seconds)
1	No	9.380	7.431
1	Yes	3.058	7.604

Note, nevertheless, that attributable to the aggressive discretization of INT4, the top results can take a success. That is why, for Transformer-based models generally, we normally leave the ultimate projection layer out of quantization. In Quanto, we do that by:

quantize(pipeline.transformer, weights=qint4, exclude="proj_out")
freeze(pipeline.transformer)

"proj_out" corresponds to the ultimate layer in pipeline.transformer. The table below presents results for various settings:

Quantize TE: No, Layer exclusion: None	Quantize TE: No, Layer exclusion: “proj_out”	Quantize TE: Yes, Layer exclusion: None	Quantize TE: Yes, Layer exclusion: “proj_out”

To get better the lost image quality, a typical practice is to perform quantization-aware training, which can be supported in Quanto. This method is out of the scope of this post, be at liberty to contact us in case you’re interested!

All the outcomes of our experiments for this post could be found here.

Bonus – saving and loading Diffusers models in Quanto

Quantized Diffusers models could be saved and loaded:

from diffusers import PixArtTransformer2DModel
from optimum.quanto import QuantizedPixArtTransformer2DModel, qfloat8

model = PixArtTransformer2DModel.from_pretrained("PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", subfolder="transformer")
qmodel = QuantizedPixArtTransformer2DModel.quantize(model, weights=qfloat8)
qmodel.save_pretrained("pixart-sigma-fp8")

The resulting checkpoint is 587MB in size, as a substitute of the unique 2.44GB. We are able to then load it:

from optimum.quanto import QuantizedPixArtTransformer2DModel
import torch

transformer = QuantizedPixArtTransformer2DModel.from_pretrained("pixart-sigma-fp8") 
transformer.to(device="cuda", dtype=torch.float16)

And use it in a DiffusionPipeline:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", 
    transformer=None,
    torch_dtype=torch.float16,
).to("cuda")
pipe.transformer = transformer

prompt = "A small cactus with a pleased face within the Sahara desert."
image = pipe(prompt).images[0]

In the long run, we are able to expect to pass the transformer directly when initializing the pipeline in order that it will work:

pipe = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", 
-    transformer=None,
+    transformer=transformer,
    torch_dtype=torch.float16,
).to("cuda")

QuantizedPixArtTransformer2DModel implementation is offered here for reference. In case you want more models from Diffusers supported in Quanto for saving and loading, please open a difficulty here and mention @sayakpaul.

Suggestions

Based in your requirements, you might wish to apply several types of quantization to different pipeline modules. For instance, you possibly can use FP8 for the text encoder but INT8 for the diffusion transformer. Due to the flexibleness of Diffusers and Quanto, this could be done seamlessly.
To optimize on your use cases, you possibly can even mix quantization with other memory optimization techniques in Diffusers, similar to enable_model_cpu_offload().

Conclusion

On this post, we showed learn how to quantize Transformer models from Diffusers and optimize their memory consumption. The results of quantization grow to be more visible after we moreover quantize the text encoders involved in the combination. We hope you’ll apply among the workflows to your projects and profit from them 🤗.

Due to Pedro Cuenca for his extensive reviews on the post.

Source link

Memory-efficient Diffusion Transformers with Quanto and Diffusers

Table of contents

Preliminaries

Quantizing a `DiffusionPipeline` with Quanto

Generality of the observations

Misc findings

`bfloat16` will likely be higher on H100

The promise of `qint8`

How about INT4?

Bonus – saving and loading Diffusers models in Quanto

Suggestions

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

Transformers.js v4 Preview: Now Available on NPM!

Making AI Work, MIT Technology Review’s latest AI newsletter, is here

Train your first Decision Transformer

The Death of the “Every thing Prompt”: Google’s Move Toward Structured AI

Memory-efficient Diffusion Transformers with Quanto and Diffusers

Table of contents

Preliminaries

Quantizing a DiffusionPipeline with Quanto

Generality of the observations

Misc findings

bfloat16 will likely be higher on H100

The promise of qint8

How about INT4?

Bonus – saving and loading Diffusers models in Quanto

Suggestions

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Quantizing a `DiffusionPipeline` with Quanto

`bfloat16` will likely be higher on H100

The promise of `qint8`

What are your thoughts on this topic?
Let us know in the comments below.