Large diffusion models like Flux (a flow-based text-to-image generation model) can create stunning images, but their size could be a hurdle, demanding significant memory and compute resources. Quantization offers a robust solution, shrinking these models to make them more accessible without drastically compromising performance. But the massive query at all times is: can you really tell the difference in the ultimate image?
Before we dive into the technical details of how various quantization backends in Hugging Face Diffusers work, why not test your individual perception?
Spot The Quantized Model
We created a setup where you possibly can provide a prompt, and we generate results using each the unique, high-precision model (e.g., Flux-dev in BF16) and several other quantized versions (BnB 4-bit, BnB 8-bit). The generated images are then presented to you and your challenge is to discover which of them got here from the quantized models.
Try it out here or below!
Often, especially with 8-bit quantization, the differences are subtle and will not be noticeable without close inspection. More aggressive quantization like 4-bit or lower is perhaps more noticeable, but the outcomes can still be good, especially considering the huge memory savings. NF4 often gives one of the best trade-off though.
Now, let’s dive deeper.
Quantization Backends in Diffusers
Constructing on our previous post, “Memory-efficient Diffusion Transformers with Quanto and Diffusers“, this post explores the various quantization backends integrated directly into Hugging Face Diffusers. We’ll examine how bitsandbytes, GGUF, torchao, Quanto and native FP8 support make large and powerful models more accessible, demonstrating their use with Flux.
Before diving into the quantization backends, let’s introduce the FluxPipeline (using the black-forest-labs/FLUX.1-dev checkpoint) and its components, which we’ll be quantizing. Loading the total FLUX.1-dev model in BF16 precision requires roughly 31.447 GB of memory. The primary components are:
- Text Encoders (CLIP and T5):
- Function: Process input text prompts. FLUX-dev uses CLIP for initial understanding and a bigger T5 for nuanced comprehension and higher text rendering.
- Memory: T5 – 9.52 GB; CLIP – 246 MB (in BF16)
- Transformer (Foremost Model – MMDiT):
- Function: Core generative part (Multimodal Diffusion Transformer). Generates images in latent space from text embeddings.
- Memory: 23.8 GB (in BF16)
- Variational Auto-Encoder (VAE):
- Function: Translates images between pixel and latent space. Decodes generated latent representation to a pixel-based image.
- Memory: 168 MB (in BF16)
- Focus of Quantization: Examples will primarily concentrate on the
transformerandtext_encoder_2(T5) for probably the most substantial memory savings.
prompts = [
"Baroque style, a lavish palace interior with ornate gilded ceilings, intricate tapestries, and dramatic lighting over a grand staircase.",
"Futurist style, a dynamic spaceport with sleek silver starships docked at angular platforms, surrounded by distant planets and glowing energy lines.",
"Noir style, a shadowy alleyway with flickering street lamps and a solitary trench-coated figure, framed by rain-soaked cobblestones and darkened storefronts.",
]
bitsandbytes (BnB)
bitsandbytes is a preferred and user-friendly library for 8-bit and 4-bit quantization, widely used for LLMs and QLoRA fine-tuning. We will use it for transformer-based diffusion and flow models, too.
|
BF16
|
BnB 4-bit
|
BnB 8-bit
|
| Visual comparison of Flux-dev model outputs using BF16 (left), BnB 4-bit (center), and BnB 8-bit (right) quantization. (Click on a picture to zoom) | ||
| Precision | Memory after loading | Peak memory | Inference time |
|---|---|---|---|
| BF16 | ~31.447 GB | 36.166 GB | 12 seconds |
| 4-bit | 12.584 GB | 17.281 GB | 12 seconds |
| 8-bit | 19.273 GB | 24.432 GB | 27 seconds |
All benchmarks performed on 1x NVIDIA H100 80GB GPU
Example (Flux-dev with BnB 4-bit):
import torch
from diffusers import FluxPipeline
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
model_id = "black-forest-labs/FLUX.1-dev"
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
"transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
"text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
}
)
pipe = FluxPipeline.from_pretrained(
model_id,
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16
)
pipe.to("cuda")
prompt = "Baroque style, a lavish palace interior with ornate gilded ceilings, intricate tapestries, and dramatic lighting over a grand staircase."
pipe_kwargs = {
"prompt": prompt,
"height": 1024,
"width": 1024,
"guidance_scale": 3.5,
"num_inference_steps": 50,
"max_sequence_length": 512,
}
print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")
image = pipe(
**pipe_kwargs, generator=torch.manual_seed(0),
).images[0]
print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")
image.save("flux-dev_bnb_4bit.png")
Note: When using
PipelineQuantizationConfigwithbitsandbytes, you’ll want to importDiffusersBitsAndBytesConfigfromdiffusersandTransformersBitsAndBytesConfigfromtransformersindividually. It’s because these components originate from different libraries. For those who prefer an easier setup without managing these distinct imports, you should use another approach for pipeline-level quantization, an example of this method is within the Diffusers documentation on Pipeline-level quantization.
For more information take a look at the bitsandbytes docs.
torchao
torchao is a PyTorch-native library for architecture optimization, offering quantization, sparsity, and custom data types, designed for compatibility with torch.compile and FSDP. Diffusers supports a big selection of torchao‘s exotic data types, enabling fine-grained control over model optimization.
|
int4_weight_only
|
int8_weight_only
|
float8_weight_only
|
| Visual comparison of Flux-dev model outputs using torchao int4_weight_only (left), int8_weight_only (center), and float8_weight_only (right) quantization. (Click on a picture to zoom) | ||
| torchao Precision | Memory after loading | Peak memory | Inference time |
|---|---|---|---|
| int4_weight_only | 10.635 GB | 14.654 GB | 109 seconds |
| int8_weight_only | 17.020 GB | 21.482 GB | 15 seconds |
| float8_weight_only | 17.016 GB | 21.488 GB | 15 seconds |
Example (Flux-dev with torchao INT8 weight-only):
@@
- from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+ from diffusers import TorchAoConfig as DiffusersTorchAoConfig
- from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
+ from transformers import TorchAoConfig as TransformersTorchAoConfig
@@
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
- "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
- "text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
+ "transformer": DiffusersTorchAoConfig("int8_weight_only"),
+ "text_encoder_2": TransformersTorchAoConfig("int8_weight_only"),
}
)
Example (Flux-dev with torchao INT4 weight-only):
@@
- from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+ from diffusers import TorchAoConfig as DiffusersTorchAoConfig
- from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
+ from transformers import TorchAoConfig as TransformersTorchAoConfig
@@
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
- "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
- "text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
+ "transformer": DiffusersTorchAoConfig("int4_weight_only"),
+ "text_encoder_2": TransformersTorchAoConfig("int4_weight_only"),
}
)
pipe = FluxPipeline.from_pretrained(
model_id,
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
+ device_map="balanced"
)
- pipe.to("cuda")
For more information take a look at the torchao docs.
Quanto
Quanto is a quantization library integrated with the Hugging Face ecosystem via the optimum library.
|
INT4
|
INT8
|
FP8
|
| Visual comparison of Flux-dev model outputs using Quanto INT4 (left), INT8 (center), and FP8 (right) quantization. (Click on a picture to zoom) | ||
| quanto Precision | Memory after loading | Peak memory | Inference time |
|---|---|---|---|
| INT4 | 12.254 GB | 16.139 GB | 109 seconds |
| INT8 | 17.330 GB | 21.814 GB | 15 seconds |
| FP8 | 16.395 GB | 20.898 GB | 16 seconds |
Example (Flux-dev with quanto INT8 weight-only):
@@
- from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+ from diffusers import QuantoConfig as DiffusersQuantoConfig
- from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
+ from transformers import QuantoConfig as TransformersQuantoConfig
@@
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
- "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
- "text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
+ "transformer": DiffusersQuantoConfig(weights_dtype="int8"),
+ "text_encoder_2": TransformersQuantoConfig(weights_dtype="int8"),
}
)
Note: On the time of writing, for float8 support with Quanto, you will need
optimum-quanto<0.2.5and use quanto directly. We can be working on fixing this.
Example (Flux-dev with quanto FP8 weight-only)
import torch
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel
from optimum.quanto import freeze, qfloat8, quantize
model_id = "black-forest-labs/FLUX.1-dev"
text_encoder_2 = T5EncoderModel.from_pretrained(
model_id,
subfolder="text_encoder_2",
torch_dtype=torch.bfloat16,
)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)
transformer = AutoModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
)
quantize(transformer, weights=qfloat8)
freeze(transformer)
pipe = FluxPipeline.from_pretrained(
model_id,
transformer=transformer,
text_encoder_2=text_encoder_2,
torch_dtype=torch.bfloat16
).to("cuda")
For more information take a look at the Quanto docs.
GGUF
GGUF is a file format popular within the llama.cpp community for storing quantized models.
|
Q2_k
|
Q4_1
|
Q8_0
|
| Visual comparison of Flux-dev model outputs using GGUF Q2_k (left), Q4_1 (center), and Q8_0 (right) quantization. (Click on a picture to zoom) | ||
| GGUF Precision | Memory after loading | Peak memory | Inference time |
|---|---|---|---|
| Q2_k | 13.264 GB | 17.752 GB | 26 seconds |
| Q4_1 | 16.838 GB | 21.326 GB | 23 seconds |
| Q8_0 | 21.502 GB | 25.973 GB | 15 seconds |
Example (Flux-dev with GGUF Q4_1)
import torch
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
model_id = "black-forest-labs/FLUX.1-dev"
ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/resolve/primary/flux1-dev-Q4_1.gguf"
transformer = FluxTransformer2DModel.from_single_file(
ckpt_path,
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
torch_dtype=torch.bfloat16,
)
pipe = FluxPipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
For more information take a look at the GGUF docs.
FP8 Layerwise Casting (enable_layerwise_casting)
FP8 Layerwise Casting is a memory optimization technique. It really works by storing the model’s weights within the compact FP8 (8-bit floating point) format, which uses roughly half the memory of ordinary FP16 or BF16 precision.
Just before a layer performs its calculations, its weights are dynamically forged as much as a better compute precision (like FP16/BF16). Immediately afterward, the weights are forged back right down to FP8 for efficient storage. This approach works since the core computations retain high precision, and layers particularly sensitive to quantization (like normalization) are typically skipped. This method can be combined with group offloading for further memory savings.
|
FP8 (e4m3)
|
| Visual output of Flux-dev model using FP8 Layerwise Casting (e4m3) quantization. |
| precision | Memory after loading | Peak memory | Inference time |
|---|---|---|---|
| FP8 (e4m3) | 23.682 GB | 28.451 GB | 13 seconds |
import torch
from diffusers import AutoModel, FluxPipeline
model_id = "black-forest-labs/FLUX.1-dev"
transformer = AutoModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16
)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
pipe = FluxPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")
For more information take a look at the Layerwise casting docs.
Combining with More Memory Optimizations and torch.compile
Most of those quantization backends could be combined with the memory optimization techniques offered in Diffusers. Let’s explore CPU offloading, group offloading, and torch.compile. You’ll be able to learn more about these techniques within the Diffusers documentation.
Note: On the time of writing, bnb +
torch.compilealso works if bnb is installed from source and using pytorch nightly or with fullgraph=False.
Example (Flux-dev with BnB 4-bit + enable_model_cpu_offload):
import torch
from diffusers import FluxPipeline
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
model_id = "black-forest-labs/FLUX.1-dev"
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
"transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
"text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
}
)
pipe = FluxPipeline.from_pretrained(
model_id,
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16
)
- pipe.to("cuda")
+ pipe.enable_model_cpu_offload()
Model CPU Offloading (enable_model_cpu_offload): This method moves entire model components (just like the UNet, text encoders, or VAE) between the CPU and GPU in the course of the inference pipeline. It offers substantial VRAM savings and is usually faster than more granular offloading since it involves fewer, larger data transfers.
bnb + enable_model_cpu_offload:
| Precision | Memory after loading | Peak memory | Inference time |
|---|---|---|---|
| 4-bit | 12.383 GB | 12.383 GB | 17 seconds |
| 8-bit | 19.182 GB | 23.428 GB | 27 seconds |
Example (Flux-dev with fp8 layerwise casting + group offloading):
import torch
from diffusers import FluxPipeline, AutoModel
model_id = "black-forest-labs/FLUX.1-dev"
transformer = AutoModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
# device_map="cuda"
)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
+ transformer.enable_group_offload(onload_device=torch.device("cuda"), offload_device=torch.device("cpu"), offload_type="leaf_level", use_stream=True)
pipe = FluxPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
- pipe.to("cuda")
Group offloading (enable_group_offload for diffusers components or apply_group_offloading for generic torch.nn.Modules): It moves groups of internal model layers (like torch.nn.ModuleList or torch.nn.Sequential instances) to the CPU. This approach is usually more memory-efficient than full model offloading and faster than sequential offloading.
FP8 layerwise casting + group offloading:
| precision | Memory after loading | Peak memory | Inference time |
|---|---|---|---|
| FP8 (e4m3) | 9.264 GB | 14.232 GB | 58 seconds |
Example (Flux-dev with torchao 4-bit + torch.compile):
import torch
from diffusers import FluxPipeline
from diffusers import TorchAoConfig as DiffusersTorchAoConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import TorchAoConfig as TransformersTorchAoConfig
from torchao.quantization import Float8WeightOnlyConfig
model_id = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
"transformer":DiffusersTorchAoConfig("int4_weight_only"),
"text_encoder_2": TransformersTorchAoConfig("int4_weight_only"),
}
)
pipe = FluxPipeline.from_pretrained(
model_id,
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
device_map="balanced"
)
+ pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
Note:
torch.compilecan introduce subtle numerical differences, resulting in changes in image output
torch.compile: One other complementary approach is to speed up the execution of your model with PyTorch 2.x’s torch.compile() feature. Compiling the model doesn’t directly lower memory, but it could actually significantly speed up inference. PyTorch 2.0’s compile (Torch Dynamo) works by tracing and optimizing the model graph ahead-of-time.
torchao + torch.compile:
| torchao Precision | Memory after loading | Peak memory | Inference time | Compile Time |
|---|---|---|---|---|
| int4_weight_only | 10.635 GB | 15.238 GB | 6 seconds | ~285 seconds |
| int8_weight_only | 17.020 GB | 22.473 GB | 8 seconds | ~851 seconds |
| float8_weight_only | 17.016 GB | 22.115 GB | 8 seconds | ~545 seconds |
Explore some benchmarking results here:
Able to use quantized checkpoints
You’ll find bitsandbytes and torchao quantized models from this blog post in our Hugging Face collection: link to collection.
Conclusion
Here’s a fast guide to selecting a quantization backend:
- Easiest Memory Savings (NVIDIA): Start with
bitsandbytes4/8-bit. This can be combined withtorch.compile()for faster inference. - Prioritize Inference Speed:
torchao,GGUF, andbitsandbytescan all be used withtorch.compile()to potentially boost inference speed. - For Hardware Flexibility (CPU/MPS), FP8 Precision:
Quantocould be option. - Simplicity (Hopper/Ada): Explore FP8 Layerwise Casting (
enable_layerwise_casting). - For Using Existing GGUF Models: Use GGUF loading (
from_single_file). - Inquisitive about training with quantization? Stay tuned for a follow-up blog post on that topic! Update (June 19, 2025): it’s here!
Quantization significantly lowers the barrier to entry for using large diffusion models. Experiment with these backends to search out one of the best balance of memory, speed, and quality in your needs.
Acknowledgements: Because of Chunte for providing the thumbnail for this post.
