Diffusers welcomes Stable Diffusion 3

-



Stable Diffusion 3 (SD3), Stability AI’s latest iteration of the Stable Diffusion family of models, is now available on the Hugging Face Hub and may be used with 🧨 Diffusers.

The model released today is Stable Diffusion 3 Medium, with 2B parameters.

As a part of this release, now we have provided:

  1. Models on the Hub
  2. Diffusers Integration
  3. SD3 Dreambooth and LoRA training scripts



Table Of Contents



What’s Recent With SD3?



Model

SD3 is a latent diffusion model that consists of three different text encoders (CLIP L/14, OpenCLIP bigG/14, and T5-v1.1-XXL), a novel Multimodal Diffusion Transformer (MMDiT) model, and a 16 channel AutoEncoder model that is analogous to the one utilized in Stable Diffusion XL.

SD3 processes text inputs and pixel latents as a sequence of embeddings. Positional encodings are added to 2×2 patches of the latents that are then flattened right into a patch encoding sequence. This sequence, together with the text encoding sequence are fed into the MMDiT blocks, where they’re embedded to a typical dimensionality, concatenated, and passed through a sequence of modulated attentions and MLPs.

With the intention to account for the differences between the 2 modalities, the MMDiT blocks use two separate sets of weights to embed the text and image sequences to a typical dimensionality. These sequences are joined before the eye operation, which allows each representations to work in their very own space while taking the opposite one into consideration in the course of the attention operation. This two-way flow of knowledge between text and image data differs from previous approaches for text-to-image synthesis, where text information is incorporated into the latent via cross-attention with a set text representation.

SD3 also makes use of the pooled text embeddings from each its CLIP models as a part of its timestep conditioning. These embeddings are first concatenated and added to the timestep embedding before being passed to every of the MMDiT blocks.



Training with Rectified Flow Matching

Along with architectural changes, SD3 applies a conditional flow-matching objective to coach the model. On this approach, the forward noising process is defined as a rectified flow that connects the info and noise distributions on a straight line.

The rectified flow-matching sampling process is easier and performs well when reducing the variety of sampling steps. To support inference with SD3, now we have introduced a brand new scheduler (FlowMatchEulerDiscreteScheduler) with a rectified flow-matching formulation and Euler method steps. It also implements resolution-dependent shifting of the timestep schedule via a shift parameter. Increasing the shift value handles noise scaling higher for higher resolutions. It’s endorsed to make use of shift=3.0 for the 2B model.

To quickly check out SD3, seek advice from the appliance below:



Using SD3 with Diffusers

To make use of SD3 with Diffusers, make certain to upgrade to the most recent Diffusers release.

pip install --upgrade diffusers

Because the model is gated, before using it with diffusers you first must go to the Stable Diffusion 3 Medium Hugging Face page, fill in the shape and accept the gate. Once you might be in, you want to log in in order that your system knows you’ve accepted the gate. Use the command below to log in:

huggingface-cli login

The next snippet will download the 2B parameter version of SD3 in fp16 precision. That is the format utilized in the unique checkpoint published by Stability AI, and is the advisable approach to run inference.



Text-To-Image

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
).to("cuda")

image = pipe(
    "A cat holding an indication that claims hello world",
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]
image

hello_world_cat



Image-To-Image

import torch
from diffusers import StableDiffusion3Img2ImgPipeline
from diffusers.utils import load_image

pipe = StableDiffusion3Img2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
).to("cuda")

init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/primary/diffusers/cat.png")
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, cute, Pixar, Disney, 8k"
image = pipe(prompt, image=init_image).images[0]
image

wizard_cat

You possibly can try the SD3 documentation here.



Memory Optimizations for SD3

SD3 uses three text encoders, one in all which is the very large T5-XXL model. This makes running the model on GPUs with lower than 24GB of VRAM difficult, even when using fp16 precision.

To account for this, the Diffusers integration ships with memory optimizations that allow SD3 to be run on a wider range of devices.



Running Inference with Model Offloading

Probably the most basic memory optimization available in Diffusers means that you can offload the components of the model to the CPU during inference with a purpose to save memory while seeing a slight increase in inference latency. Model offloading will only move a model component onto the GPU when it must be executed while keeping the remaining components on the CPU.

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

prompt = "smiling cartoon dog sits at a table, coffee mug readily available, as a room goes up in flames. “That is advantageous,” the dog assures himself."
image = pipe(prompt).images[0]



Dropping the T5 Text Encoder during Inference

Removing the memory-intensive 4.7B parameter T5-XXL text encoder during inference can significantly decrease the memory requirements for SD3 with only a slight loss in performance.

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    text_encoder_3=None, 
    tokenizer_3=None, 
    torch_dtype=torch.float16
).to("cuda")

prompt = "smiling cartoon dog sits at a table, coffee mug readily available, as a room goes up in flames. “That is advantageous,” the dog assures himself."
image = pipe(prompt).images[0]



Using A Quantized Version of the T5-XXL Model

You possibly can load the T5-XXL model in 8 bits using the bitsandbytes library to cut back the memory requirements further.

import torch
from diffusers import StableDiffusion3Pipeline
from transformers import T5EncoderModel, BitsAndBytesConfig


quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_id = "stabilityai/stable-diffusion-3-medium-diffusers"
text_encoder = T5EncoderModel.from_pretrained(
    model_id,
    subfolder="text_encoder_3",
    quantization_config=quantization_config,
)
pipe = StableDiffusion3Pipeline.from_pretrained(
    model_id,
    text_encoder_3=text_encoder,
    device_map="balanced",
    torch_dtype=torch.float16
)

You will discover the complete code snippet here.



Summary of Memory Optimizations

All benchmarking runs were conducted using the 2B version of the SD3 model on an A100 GPU with 80GB of VRAM using fp16 precision and PyTorch 2.3.

For our memory benchmarks, we use 3 iterations of pipeline calls for warming up and report a median inference time of 10 iterations of pipeline calls. We use the default arguments of the StableDiffusion3Pipeline __call__() method.

Technique Inference Time (secs) Memory (GB)
Default 4.762 18.765
Offloading 32.765 (~6.8x 🔼) 12.0645 (~1.55x 🔽)
Offloading + no T5 19.110 (~4.013x 🔼) 4.266 (~4.398x 🔽)
8bit T5 4.932 (~1.036x 🔼) 10.586 (~1.77x 🔽)



Performance Optimizations for SD3

To spice up inference latency, we will use torch.compile() to acquire an optimized compute graph of the vae and the transformer components.

import torch
from diffusers import StableDiffusion3Pipeline

torch.set_float32_matmul_precision("high")

torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
).to("cuda")
pipe.set_progress_bar_config(disable=True)

pipe.transformer.to(memory_format=torch.channels_last)
pipe.vae.to(memory_format=torch.channels_last)

pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)


prompt = "a photograph of a cat holding an indication that claims hello world",
for _ in range(3):
    _ = pipe(prompt=prompt, generator=torch.manual_seed(1))


image = pipe(prompt=prompt, generator=torch.manual_seed(1)).images[0]
image.save("sd3_hello_world.png")

Refer here for the complete script.

We benchmarked the performance of torch.compile()on SD3 on a single 80GB A100 machine using fp16 precision and PyTorch 2.3. We ran 10 iterations of a pipeline inference call with 20 diffusion steps. We found that the typical inference time with the compiled versions of the models was 0.585 seconds, a 4X speed up over eager execution.



Dreambooth and LoRA fine-tuning

Moreover, we’re providing a DreamBooth fine-tuning script for SD3 leveraging LoRA. The script may be used to efficiently fine-tune SD3 and serves as a reference for implementing rectified flow-based training pipelines. Other popular implementations of rectified flow include minRF.

To start with the script, first, ensure you’ve gotten the correct setup and a demo dataset available (akin to this one). Refer here for details. Install peft and bitsandbytes after which we’re good to go:

export MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth-sd3-lora"

speed up launch train_dreambooth_lora_sd3.py 
  --pretrained_model_name_or_path=${MODEL_NAME}  
  --instance_data_dir=${INSTANCE_DIR} 
  --output_dir=/raid/.cache/${OUTPUT_DIR} 
  --mixed_precision="fp16" 
  --instance_prompt="a photograph of sks dog" 
  --resolution=1024 
  --train_batch_size=1 
  --gradient_accumulation_steps=4 
  --learning_rate=1e-5 
  --report_to="wandb" 
  --lr_scheduler="constant" 
  --lr_warmup_steps=0 
  --max_train_steps=500 
  --weighting_scheme="logit_normal" 
  --validation_prompt="A photograph of sks dog in a bucket" 
  --validation_epochs=25 
  --seed="0" 
  --push_to_hub



Acknowledgements

Because of the Stability AI team for making Stable Diffusion 3 occur and providing us with its early access. Because of Linoy for helping us with the blogpost thumbnail.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x