Fast LoRA inference for Flux with Diffusers and PEFT

LoRA adapters provide a fantastic deal of customization for models of all sizes and styles. On the subject of image generation, they’ll empower the models with different styles, different characters, and way more. Sometimes, they will also be leveraged to cut back inference latency. Hence, their importance is paramount, particularly with regards to customizing and fine-tuning models.

On this post, we take the Flux.1-Dev model for text-to-image generation due to its widespread popularity and adoption, and how one can optimize its inference speed when using LoRAs (~2.3x). It has over 30k adapters trained with it (as reported on the Hugging Face Hub platform). Subsequently, its importance to the community is critical.

Note that though we show speedups with Flux, our belief is that our recipe is generic enough to be applied to other models as well.

If you happen to cannot wait to start with the code, please try the accompanying code repository.

When serving LoRAs, it’s common to hotswap (swap in and swap out different LoRAs) them. A LoRA changes the bottom model architecture. Moreover, LoRAs could be different from each other – each one in all them could have various ranks and different layers they aim for adaptation. To account for these dynamic properties of LoRAs, we must take essential steps to make sure the optimizations we apply are robust.

For instance, we will apply torch.compile on a model loaded with a selected LoRA to acquire speedups on inference latency. Nonetheless, the moment we swap out the LoRA with a unique one (with a potentially different configuration), we’ll run into recompilation issues, causing slowdowns in inference.

One can even fuse the LoRA parameters into the bottom model parameters, run compilation, and unfuse the LoRA parameters when loading latest ones. Nonetheless, this approach will again encounter the issue of recompilation at any time when inference is run, as a consequence of potential architecture-level changes.

Our optimization recipe takes under consideration the above-mentioned situations to be as realistic as possible. Below are the important thing components of our optimization recipe:

Flash Attention 3 (FA3)
torch.compile
FP8 quantization from TorchAO
Hotswapping-ready

Note that amongst the above-mentioned, FP8 quantization is lossy but often provides essentially the most formidable speed-memory trade-off. Although we tested the recipe primarily using NVIDIA GPUs, it should work on AMD GPUs, too.

In our previous blog posts (post 1 and post 2), we’ve got already discussed the advantages of using the primary three components of our optimization recipe. Applying them one after the other is just just a few lines of code:

from diffusers import DiffusionPipeline, TorchAoConfig
from diffusers.quantizers import PipelineQuantizationConfig
from utils.fa3_processor import FlashFluxAttnProcessor3_0
import torch


pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
    quantization_config=PipelineQuantizationConfig(
        quant_mapping={"transformer": TorchAoConfig("float8dq_e4m3_row")}
    )
).to("cuda")


pipe.transformer.set_attn_processor(FlashFluxAttnProcessor3_0())


pipe.transformer.compile(fullgraph=True, mode="max-autotune")


pipe_kwargs = {
    "prompt": "A cat holding an indication that claims hello world",
    "height": 1024,
    "width": 1024,
    "guidance_scale": 3.5,
    "num_inference_steps": 28,
    "max_sequence_length": 512,
}


image = pipe(**pipe_kwargs).images[0]

The FA3 processor comes from here.

The issues start surfacing once we attempt to swap in and swap out LoRAs right into a compiled diffusion transformer (pipe.transformer) without triggering recompilation.

Normally, loading and unloading LoRAs would require recompilation, which defeats any speed advantage gained from compilation. Thankfully, there may be a solution to avoid the necessity for recompilation. By passing hotswap=True, diffusers will leave the model architecture unchanged and only exchange the weights of the LoRA adapter itself, which doesn’t necessitate recompilation.

pipe.enable_lora_hotswap(target_rank=max_rank)
pipe.load_lora_weights()

pipe.transformer.compile(mode="max-autotune", fullgraph=True)
image = pipe(**pipe_kwargs).images[0]

pipe.load_lora_weights(, hotswap=True)
image = pipe(**pipe_kwargs).images[0]

(As a reminder, the primary call to pipe shall be slow as torch.compile is a just-in-time compiler. Nonetheless, the following calls must be significantly faster.)

This generally allows for swapping LoRAs without recompilation, but there are limitations:

We’d like to offer the utmost rank amongst all LoRA adapters ahead of time. Thus, if we’ve got one adapter with rank 16 and one other with 32, we’d like to pass max_rank=32.
LoRA adapters which can be hotswapped in can only goal the identical layers, or a subset of layers, that the primary LoRA targets.
Targeting the text encoder shouldn’t be supported yet.

For more information on hotswapping in Diffusers and its limitations, visit the hotswapping section of the documentation.

The advantages of this workflow turn into evident once we take a look at the inference latency without using compilation with hotswapping.

Option	Time (s) ⬇️	Speedup (vs baseline) ⬆️	Notes
baseline	7.8910	–	Baseline
optimized	3.5464	2.23×	Hotswapping + compilation without recompilation hiccups (FP8 on by default)
no_fp8	4.3520	1.81×	Same as optimized, but with FP8 quantization disabled
no_fa3	4.3020	1.84×	Disable FA3 (flash‑attention v3)
baseline + compile	5.0920	1.55×	Compilation on, but suffers from intermittent recompilation stalls
no_fa3_fp8	5.0850	1.55×	Disable FA3 and FP8
no_compile_fp8	7.5190	1.05×	Disable FP8 quantization and compilation
no_compile	10.4340	0.76×	Disable compilation: the slowest setting

Key takeaways:

The “regular + compile” option provides an honest speedup over the regular option, nevertheless it incurs recompilation issues, which increase the general execution time. In our benchmarks, we don’t present the compilation time.
When recompilation problems are eliminated through hotswapping (also often known as the “optimized” option), we achieve the best speedup.
Within the “optimized” option, FP8 quantization is enabled, which may result in quality loss. Even without using FP8, we get an honest amount of speedup (“no_fp8” option).
For demonstration purposes, we use a pool of two LoRAs for hotswapping with compilation. For the complete code, please seek advice from the accompanying code repository.

The optimization recipe we’ve got discussed up to now assumes access to a strong GPU like H100. Nonetheless, what can we do once we’re limited to using consumer GPUs reminiscent of RTX 4090? Let’s discover.

Flux.1-Dev (with none LoRA), using the Bfloat16 data-type, takes ~33GB of memory to run. Depending on the scale of the LoRA module, and without using any optimization, this memory footprint can increase even further. Many consumer GPUs just like the RTX 4090 only have 24GB. Throughout the remainder of this section, we’ll consider an RTX 4090 machine as our testbed.

First, to enable end-to-end execution of Flux.1-Dev, we will apply CPU offloading wherein components that usually are not needed to execute the present computation are offloaded to the CPU to free more accelerator memory. Doing so allows us to run your entire pipeline in ~22GB in 35.403 seconds on an RTX 4090. Enabling compilation can reduce the latency all the way down to 31.205 seconds (1.12x speedup). When it comes to code, it’s just just a few lines:

pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()




pipe.transformer.compile_repeated_blocks(fullgraph=True)
image = pipe(**pipe_kwargs).images[0]

Notice that we didn’t apply the FP8 quantization here since it’s not supported with CPU offloading and compilation (supporting issue thread). Subsequently, just applying FP8 quantization to the Flux Transformer isn’t enough to mitigate the memory exhaustion problem, either. On this instance, we decided to remove it.

Subsequently, to make the most of the FP8 quantization scheme, we’d like to seek out a solution to do it without CPU offloading. For Flux.1-Dev, if we moreover apply quantization to the T5 text encoder, we must always give you the option to load and run the whole pipeline in 24GB. Below is a comparison of the outcomes with and without the T5 text encoder being quantized (NF4 quantization from bitsandbytes).

As we will notice within the figure above, quantizing the T5 text encoder doesn’t incur an excessive amount of of a top quality loss. Combining the quantized T5 text encoder and FP8-quantized Flux Transformer with torch.compile gives us somewhat reasonable results – 9.668 seconds from 32.27 seconds (an enormous ~3.3x speedup) with no noticeable quality drop.

It is feasible to generate images with 24 GB of VRAM even without quantizing the T5 text encoder, but that might have made our generation pipeline barely more complicated.

We now have a solution to run your entire Flux.1-Dev pipeline with FP8 quantization on an RTX 4090. We are able to apply the previously established optimization recipe for optimizing LoRA inference on the identical hardware. Since FA3 isn’t supported on RTX 4090, we’ll keep on with the next optimization recipe with T5 quantization newly added to the combination:

FP8 quantization
torch.compile
Hotswapping-ready
T5 quantization (with NF4)

Within the table below, we show the inference latency numbers with different mixtures of the above components applied.

Option	Key args flags	Time (s) ⬇️	Speedup (vs baseline) ⬆️
baseline	`disable_fp8=False disable_compile=True` `quantize_t5=True offload=False`	23.6060	–
optimized	`disable_fp8=False disable_compile=False` `quantize_t5=True offload=False`	11.5715	2.04×

Quick notes:

Compilation provides an enormous 2x speedup over the baseline.
The opposite options yielded OOM errors even with offloading enabled.

Technical details of hotswapping

To enable hotswapping without triggering recompilation, two hurdles need to be overcome. First, the LoRA scaling factor needs to be converted into torch tensors from floats, which is achieved fairly easily. Second, the form of the LoRA weights must padded to the biggest required shape. That way, the information within the weights could be replaced without the necessity to reassign the entire attribute. For this reason the max_rank argument discussed above is crucial. As we pad the values with zeros, the outcomes remain unchanged, although the computation is slowed down a bit depending on how large the padding is.

Since no latest LoRA attributes are added, this also requires that every LoRA after the primary one can only goal the identical layers, or a subset of layers, that the primary one targets. Thus, select the order of loading correctly. If LoRAs goal disjoint layers, there may be the likelihood to create a dummy LoRA that targets the union of all goal layers.

To see the nitty-gritty of this implementation, visit the hotswap.py file in PEFT.

This post outlined an optimization recipe for fast LoRA inference with Flux, demonstrating significant speedups. Our approach combines Flash Attention 3, torch.compile, and FP8 quantization while ensuring hotswapping capabilities without recompilation issues. On high-end GPUs just like the H100, this optimized setup provides a 2.23x speedup over the baseline.

For consumer GPUs, specifically the RTX 4090, we tackled memory limitations by introducing T5 text encoder quantization (NF4) and leveraging regional compilation. This comprehensive recipe achieved a considerable 2.04x speedup, making LoRA inference on Flux viable and performant even with limited VRAM. The important thing insight is that by rigorously managing compilation and quantization, the advantages of LoRA could be fully realized across different hardware configurations.

Hopefully, the recipes from this post will encourage you to optimize your
LoRA-based use cases, benefitting from speedy inference.

Resources

Below is an inventory of the vital resources that we cited throughout this post:

Source link

Fast LoRA inference for Flux with Diffusers and PEFT

Table of contents

Technical details of hotswapping

Resources

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Enabling Scalable AI-Driven Molecular Dynamics Simulations

Implementing the Rock Paper Scissors Game in Python

Construct an AI Agent to Analyze IT Tickets with NVIDIA Nemotron

a faster, friendlier Hugging Face CLI ✨

Water Cooler Small Talk, Ep. 10: So, What In regards to the AI Bubble?

Fast LoRA inference for Flux with Diffusers and PEFT

Table of contents

Technical details of hotswapping

Resources

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.