Over the past few months, now we have seen an emergence in using Transformer-based diffusion backbones for high-resolution text-to-image (T2I) generation. These models use the transformer architecture because the constructing block for the diffusion process, as a substitute of the UNet architecture that was prevalent in most of the initial diffusion models. Due to the character of Transformers, these backbones show good scalability, with models starting from 0.6B to 8B parameters.
As models grow to be larger, memory requirements increase. The issue intensifies because a diffusion pipeline normally consists of several components: a text encoder, a diffusion backbone, and a picture decoder. Moreover, modern diffusion pipelines use multiple text encoders – for instance, there are three within the case of Stable Diffusion 3. It takes 18.765 GB of GPU memory to run SD3 inference using FP16 precision.
These high memory requirements could make it difficult to make use of these models with consumer GPUs, slowing adoption and making experimentation harder. On this post, we show learn how to improve the memory efficiency of Transformer-based diffusion pipelines by leveraging Quanto’s quantization utilities from the Diffusers library.
Table of contents
Preliminaries
For an in depth introduction to Quanto, please check with this post. Briefly, Quanto is a quantization toolkit built on PyTorch. It’s a part of Hugging Face Optimum, a set of tools for hardware optimization.
Model quantization is a preferred tool amongst LLM practitioners, but not a lot with diffusion models. Quanto will help bridge this gap and supply memory savings with little or no quality degradation.
For benchmarking purposes, we use an H100 GPU with the next environment:
Unless otherwise specified, we default to performing computations in FP16. We selected to not quantize the VAE to stop numerical instability issues. Our benchmarking code could be found here.
On the time of this writing, now we have the next Transformer-based diffusion pipelines for text-to-image generation in Diffusers:
We even have Latte, a Transformer-based text-to-video generation pipeline.
For brevity, we keep our study limited to the next three: PixArt-Sigma, Stable Diffusion 3, and Aura Flow. The table below shows the parameter counts of their diffusion backbones:
It’s value keeping in mind that this post primarily focuses on memory efficiency at a slight or negligible cost of inference latency.
Quantizing a DiffusionPipeline with Quanto
Quantizing a model with Quanto is simple.
from optimum.quanto import freeze, qfloat8, quantize
from diffusers import PixArtSigmaPipeline
import torch
pipeline = PixArtSigmaPipeline.from_pretrained(
"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", torch_dtype=torch.float16
).to("cuda")
quantize(pipeline.transformer, weights=qfloat8)
freeze(pipeline.transformer)
We call quantize() on the module to be quantized, specifying what we would like to quantize. Within the above case, we are only quantizing the parameters, leaving the activations as is. We’re quantizing to the FP8 data-type. We finally call freeze() to switch the unique parameters with the quantized parameters.
We are able to then call this pipeline normally:
image = pipeline("ghibli style, a fantasy landscape with castles").images[0]
| FP16 | Diffusion Transformer in FP8 |
|---|---|
![]() |
![]() |
We notice the next memory savings when using FP8, with barely higher latency and almost no quality degradation:
| Batch Size | Quantization | Memory (GB) | Latency (Seconds) |
|---|---|---|---|
| 1 | None | 12.086 | 1.200 |
| 1 | FP8 | 11.547 | 1.540 |
| 4 | None | 12.087 | 4.482 |
| 4 | FP8 | 11.548 | 5.109 |
We are able to quantize the text encoder in the identical way:
quantize(pipeline.text_encoder, weights=qfloat8)
freeze(pipeline.text_encoder)
The text encoder can be a transformer model, and we are able to quantize it too. Quantizing each the text encoder and the diffusion backbone results in much larger memory improvements:
| Batch Size | Quantization | Quantize TE | Memory (GB) | Latency (Seconds) |
|---|---|---|---|---|
| 1 | FP8 | False | 11.547 | 1.540 |
| 1 | FP8 | True | 5.363 | 1.601 |
| 4 | FP8 | False | 11.548 | 5.109 |
| 4 | FP8 | True | 5.364 | 5.141 |
Quantizing the text encoder produces results very much like the previous case:
Generality of the observations
Quantizing the text encoder along with the diffusion backbone generally works for the models we tried. Stable Diffusion 3 is a special case, because it uses three different text encoders. We found that quantizing the second text encoder doesn’t work well, so we recommend the next alternatives:
The table below gives an idea concerning the expected memory savings for various text encoder quantization mixtures (the diffusion transformer is quantized in all cases):
| Batch Size | Quantization | Quantize TE 1 | Quantize TE 2 | Quantize TE 3 | Memory (GB) | Latency (Seconds) |
|---|---|---|---|---|---|---|
| 1 | FP8 | 1 | 1 | 1 | 8.200 | 2.858 |
| 1 ✅ | FP8 | 0 | 0 | 1 | 8.294 | 2.781 |
| 1 | FP8 | 1 | 1 | 0 | 14.384 | 2.833 |
| 1 | FP8 | 0 | 1 | 0 | 14.475 | 2.818 |
| 1 ✅ | FP8 | 1 | 0 | 0 | 14.384 | 2.730 |
| 1 | FP8 | 0 | 1 | 1 | 8.325 | 2.875 |
| 1 ✅ | FP8 | 1 | 0 | 1 | 8.204 | 2.789 |
| 1 | None | – | – | – | 16.403 | 2.118 |
| Quantized Text Encoder: 1 | Quantized Text Encoder: 3 | Quantized Text Encoders: 1 and three |
|---|---|---|
![]() |
![]() |
![]() |
Misc findings
bfloat16 will likely be higher on H100
Using bfloat16 could be faster for supported GPU architectures, similar to H100 or 4090. The table below presents some numbers for PixArt measured on our H100 reference hardware:
| Batch Size | Precision | Quantization | Memory (GB) | Latency (Seconds) | Quantize TE |
|---|---|---|---|---|---|
| 1 | FP16 | INT8 | 5.363 | 1.538 | True |
| 1 | BF16 | INT8 | 5.364 | 1.454 | True |
| 1 | FP16 | FP8 | 5.363 | 1.601 | True |
| 1 | BF16 | FP8 | 5.363 | 1.495 | True |
The promise of qint8
We found quantizing with qint8 (as a substitute of qfloat8) is mostly higher by way of inference latency. This effect gets more pronounced after we horizontally fuse the eye QKV projections (calling fuse_qkv_projections() in Diffusers), thereby thickening the size of the int8 kernels to hurry up computation. We present some evidence below for PixArt:
| Batch Size | Quantization | Memory (GB) | Latency (Seconds) | Quantize TE | QKV Projection |
|---|---|---|---|---|---|
| 1 | INT8 | 5.363 | 1.538 | True | False |
| 1 | INT8 | 5.536 | 1.504 | True | True |
| 4 | INT8 | 5.365 | 5.129 | True | False |
| 4 | INT8 | 5.538 | 4.989 | True | True |
How about INT4?
We moreover experimented with qint4 when using bfloat16. This is barely applicable to bfloat16 on H100 because other configurations are usually not supported yet. With qint4, we are able to expect to see more improvements in memory consumption at the associated fee of increased inference latency. Increased latency is predicted, because there isn’t a native hardware support for int4 computation – the weights are transferred using 4 bits, but computation remains to be done in bfloat16. The table below shows our results for PixArt-Sigma:
| Batch Size | Quantize TE | Memory (GB) | Latency (Seconds) |
|---|---|---|---|
| 1 | No | 9.380 | 7.431 |
| 1 | Yes | 3.058 | 7.604 |
Note, nevertheless, that attributable to the aggressive discretization of INT4, the top results can take a success. That is why, for Transformer-based models generally, we normally leave the ultimate projection layer out of quantization. In Quanto, we do that by:
quantize(pipeline.transformer, weights=qint4, exclude="proj_out")
freeze(pipeline.transformer)
"proj_out" corresponds to the ultimate layer in pipeline.transformer. The table below presents results for various settings:
| Quantize TE: No, Layer exclusion: None | Quantize TE: No, Layer exclusion: “proj_out” | Quantize TE: Yes, Layer exclusion: None | Quantize TE: Yes, Layer exclusion: “proj_out” |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
To get better the lost image quality, a typical practice is to perform quantization-aware training, which can be supported in Quanto. This method is out of the scope of this post, be at liberty to contact us in case you’re interested!
All the outcomes of our experiments for this post could be found here.
Bonus – saving and loading Diffusers models in Quanto
Quantized Diffusers models could be saved and loaded:
from diffusers import PixArtTransformer2DModel
from optimum.quanto import QuantizedPixArtTransformer2DModel, qfloat8
model = PixArtTransformer2DModel.from_pretrained("PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", subfolder="transformer")
qmodel = QuantizedPixArtTransformer2DModel.quantize(model, weights=qfloat8)
qmodel.save_pretrained("pixart-sigma-fp8")
The resulting checkpoint is 587MB in size, as a substitute of the unique 2.44GB. We are able to then load it:
from optimum.quanto import QuantizedPixArtTransformer2DModel
import torch
transformer = QuantizedPixArtTransformer2DModel.from_pretrained("pixart-sigma-fp8")
transformer.to(device="cuda", dtype=torch.float16)
And use it in a DiffusionPipeline:
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained(
"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
transformer=None,
torch_dtype=torch.float16,
).to("cuda")
pipe.transformer = transformer
prompt = "A small cactus with a pleased face within the Sahara desert."
image = pipe(prompt).images[0]
In the long run, we are able to expect to pass the transformer directly when initializing the pipeline in order that it will work:
pipe = PixArtSigmaPipeline.from_pretrained(
"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
- transformer=None,
+ transformer=transformer,
torch_dtype=torch.float16,
).to("cuda")
QuantizedPixArtTransformer2DModel implementation is offered here for reference. In case you want more models from Diffusers supported in Quanto for saving and loading, please open a difficulty here and mention @sayakpaul.
Suggestions
- Based in your requirements, you might wish to apply several types of quantization to different pipeline modules. For instance, you possibly can use FP8 for the text encoder but INT8 for the diffusion transformer. Due to the flexibleness of Diffusers and Quanto, this could be done seamlessly.
- To optimize on your use cases, you possibly can even mix quantization with other memory optimization techniques in Diffusers, similar to
enable_model_cpu_offload().
Conclusion
On this post, we showed learn how to quantize Transformer models from Diffusers and optimize their memory consumption. The results of quantization grow to be more visible after we moreover quantize the text encoders involved in the combination. We hope you’ll apply among the workflows to your projects and profit from them 🤗.
Due to Pedro Cuenca for his extensive reviews on the post.










