Exploring easy optimizations for SDXL

-


Sayak Paul's avatar

Steven Liu's avatar



Open In Colab

Stable Diffusion XL (SDXL) is the most recent latent diffusion model by Stability AI for generating high-quality super realistic images. It overcomes challenges of previous Stable Diffusion models like getting hands and text right in addition to spatially correct compositions. As well as, SDXL can also be more context aware and requires fewer words in its prompt to generate higher looking images.

Nonetheless, all of those improvements come on the expense of a significantly larger model. How much larger? The bottom SDXL model has 3.5B parameters (the UNet, particularly), which is roughly 3x larger than the previous Stable Diffusion model.

To explore how we will optimize SDXL for inference speed and memory use, we ran some tests on an A100 GPU (40 GB). For every inference run, we generate 4 images and repeat it 3 times. While computing the inference latency, we only consider the ultimate iteration out of the three iterations.

So should you run SDXL out-of-the-box as is with full precision and use the default attention mechanism, it’ll devour 28GB of memory and take 72.2 seconds!

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0").to("cuda")
pipe.unet.set_default_attn_processor()

This isn’t very practical and might slow you down since you’re often generating greater than 4 images. And should you don’t have a more powerful GPU, you’ll run into that frustrating out-of-memory error message. So how can we optimize SDXL to extend inference speed and reduce its memory-usage?

In 🤗 Diffusers, we’ve a bunch of optimization tricks and techniques to enable you to run memory-intensive models like SDXL and we’ll show you the way! The 2 things we’ll give attention to are inference speed and memory.

🧠 The techniques discussed on this post are applicable to all of the pipelines.



Inference speed

Diffusion is a random process, so there is no guarantee you will get a picture you’ll like. Often times, you’ll must run inference multiple times and iterate, and that’s why optimizing for speed is crucial. This section focuses on using lower precision weights and incorporating memory-efficient attention and torch.compile from PyTorch 2.0 to spice up speed and reduce inference time.



Lower precision

Model weights are stored at a certain precision which is expressed as a floating point data type. The usual floating point data type is float32 (fp32), which may accurately represent a big selection of floating numbers. For inference, you regularly don’t must be as precise so it’s best to use float16 (fp16) which captures a narrower range of floating numbers. This implies fp16 only takes half the quantity of memory to store in comparison with fp32, and is twice as fast since it is simpler to calculate. As well as, modern GPU cards have optimized hardware to run fp16 calculations, making it even faster.

With 🤗 Diffusers, you should utilize fp16 for inference by specifying the torch.dtype parameter to convert the weights when the model is loaded:

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")
pipe.unet.set_default_attn_processor()

In comparison with a very unoptimized SDXL pipeline, using fp16 takes 21.7GB of memory and only 14.8 seconds. You’re almost speeding up inference by a full minute!



Memory-efficient attention

The eye blocks utilized in transformers modules could be a huge bottleneck, because memory increases quadratically as input sequences get longer. This could quickly take up a ton of memory and leave you with an out-of-memory error message. 😬

Memory-efficient attention algorithms seek to scale back the memory burden of calculating attention, whether it’s by exploiting sparsity or tiling. These optimized algorithms was once mostly available as third-party libraries that needed to be installed individually. But starting with PyTorch 2.0, this is not any longer the case. PyTorch 2 introduced scaled dot product attention (SDPA), which offers fused implementations of Flash Attention, memory-efficient attention (xFormers), and a PyTorch implementation in C++. SDPA might be the best strategy to speed up inference: should you’re using PyTorch ≥ 2.0 with 🤗 Diffusers, it’s robotically enabled by default!

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

In comparison with a very unoptimized SDXL pipeline, using fp16 and SDPA takes the identical amount of memory and the inference time improves to 11.4 seconds. Let’s use this as the brand new baseline we’ll compare the opposite optimizations to.



torch.compile

PyTorch 2.0 also introduced the torch.compile API for just-in-time (JIT) compilation of your PyTorch code into more optimized kernels for inference. Unlike other compiler solutions, torch.compile requires minimal changes to your existing code and it’s as easy as wrapping your model with the function.

With the mode parameter, you possibly can optimize for memory overhead or inference speed during compilation, which provides you far more flexibility.

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

In comparison with the previous baseline (fp16 + SDPA), wrapping the UNet with torch.compile improves inference time to 10.2 seconds.

⚠️ The primary time you compile a model is slower, but once the model is compiled, all subsequent calls to it are much faster!


Model memory footprint

Models today are growing larger and bigger, making it a challenge to suit them into memory. This section focuses on how you possibly can reduce the memory footprint of those enormous models so you possibly can run them on consumer GPUs. These techniques include CPU offloading, decoding latents into images over several steps quite than unexpectedly, and using a distilled version of the autoencoder.



Model CPU offloading

Model offloading saves memory by loading the UNet into the GPU memory while the opposite components of the diffusion model (text encoders, VAE) are loaded onto the CPU. This fashion, the UNet can run for multiple iterations on the GPU until it is not any longer needed.

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

In comparison with the baseline, it now takes 20.2GB of memory which saves you 1.5GB of memory.



Sequential CPU offloading

One other variety of offloading which may prevent more memory on the expense of slower inference is sequential CPU offloading. Somewhat than offloading a whole model – just like the UNet – model weights stored in numerous UNet submodules are offloaded to the CPU and only loaded onto the GPU right before the forward pass. Essentially, you’re only loading parts of the model every time which lets you save much more memory. The one downside is that it’s significantly slower since you’re loading and offloading submodules persistently.

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
)
pipe.enable_sequential_cpu_offload()

In comparison with the baseline, this takes 19.9GB of memory however the inference time increases to 67 seconds.



Slicing

In SDXL, a variational encoder (VAE) decodes the refined latents (predicted by the UNet) into realistic images. The memory requirement of this step scales with the variety of images being predicted (the batch size). Depending on the image resolution and the available GPU VRAM, it could be quite memory-intensive.

That is where “slicing” is helpful. The input tensor to be decoded is split into slices and the computation to decode it’s accomplished over several steps. This protects memory and allows larger batch sizes.

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")
pipe.enable_vae_slicing()

With sliced computations, we reduce the memory to fifteen.4GB. If we add sequential CPU offloading, it’s further reduced to 11.45GB which allows you to generate 4 images (1024×1024) per prompt. Nonetheless, with sequential offloading, the inference latency also increases.



Caching computations

Any text-conditioned image generation model typically uses a text encoder to compute embeddings from the input prompt. SDXL uses two text encoders! This contributes quite a bit to the inference latency. Nonetheless, since these embeddings remain unchanged throughout the reverse diffusion process, we will precompute them and reuse them as we go. This fashion, after computing the text embeddings, we will remove the text encoders from memory.

First, load the text encoders and their corresponding tokenizers and compute the embeddings from the input prompt:

tokenizers = [tokenizer, tokenizer_2]
text_encoders = [text_encoder, text_encoder_2]

(
    prompt_embeds,
    negative_prompt_embeds,
    pooled_prompt_embeds,
    negative_pooled_prompt_embeds
) = encode_prompt(tokenizers, text_encoders, prompt)

Next, flush the GPU memory to remove the text encoders:

del text_encoder, text_encoder_2, tokenizer, tokenizer_2
flush()

Now the embeddings are good to go straight to the SDXL pipeline:

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    text_encoder=None,
    text_encoder_2=None,
    tokenizer=None,
    tokenizer_2=None,
    torch_dtype=torch.float16,
).to("cuda")

call_args = dict(
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
        pooled_prompt_embeds=pooled_prompt_embeds,
        negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
        num_images_per_prompt=num_images_per_prompt,
        num_inference_steps=num_inference_steps,
)
image = pipe(**call_args).images[0]

Combined with SDPA and fp16, we will reduce the memory to 21.9GB. Other techniques discussed above for optimizing memory can be used with cached computations.



Tiny Autoencoder

As previously mentioned, a VAE decodes latents into images. Naturally, this step is directly bottlenecked by the dimensions of the VAE. So, let’s just use a smaller autoencoder! The Tiny Autoencoder by madebyollin, available the Hub is just 10MB and it’s distilled from the unique VAE utilized by SDXL.

from diffusers import AutoencoderTiny

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
)
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesdxl", torch_dtype=torch.float16)
pipe.to("cuda")

With this setup, we reduce the memory requirement to fifteen.6GB while reducing the inference latency at the identical time.

⚠️ The Tiny Autoencoder can omit among the more fine-grained details from images, which is why the Tiny AutoEncoder is more appropriate for image previews.



Conclusion

To conclude and summarize the savings from our optimizations:

⚠️ While profiling GPUs to measure the trade-off between inference latency and memory requirements, it is crucial to pay attention to the hardware getting used. The above findings may not translate equally from hardware to hardware. For instance, `torch.compile` only seems to learn modern GPUs, no less than for SDXL.

Technique Memory (GB) Inference latency (ms)
unoptimized pipeline 28.09 72200.5
fp16 21.72 14800.9
fp16 + SDPA (default) 21.72 11413.0
default + torch.compile 21.73 10296.7
default + model CPU offload 20.21 16082.2
default + sequential CPU offload 19.91 67034.0
default + VAE slicing 15.40 11232.2
default + VAE slicing + sequential CPU offload 11.47 66869.2
default + precomputed text embeddings 21.85 11909.0
default + Tiny Autoencoder 15.48 10449.7

We hope these optimizations make it a breeze to run your favorite pipelines. Try these techniques out and share your images with us! 🤗


Acknowledgements: Thanks to Pedro Cuenca for his helpful reviews on the draft.





Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x