Diffusers welcomes Stable Diffusion 3.5 Large

Stable Diffusion 3.5 is the improved variant of its predecessor, Stable Diffusion 3.
As of today, the models can be found on the Hugging Face Hub and will be used with 🧨 Diffusers.

The discharge comes with two checkpoints:

A big (8B) model
A big (8B) timestep-distilled model enabling few-step inference

On this post, we’ll concentrate on the way to use Stable Diffusion 3.5 (SD3.5) with Diffusers, covering each inference and training.

Architectural changes

The transformer architecture of SD3.5 (large) may be very just like SD3 (medium), with the next changes:

QK normalization: For training large transformer models, QK normalization has now grow to be a regular, and SD3.5 Large isn’t any exception.
Dual attention layers: As an alternative of using single attention layers for every stream of modality within the MMDiT blocks, SD3.5 uses double attention layers.

The remaining of the small print when it comes to the text encoders, VAE, and noise scheduler stay the exact same as in SD3 Medium. For more on SD3, we recommend testing the original paper.

Using SD3.5 with Diffusers

Ensure that you put in the most recent version of diffusers:

pip install -U diffusers

Because the model is gated, before using it with diffusers, you first must go to the Stable Diffusion 3.5 Large Hugging Face page, fill in the shape and accept the gate.
Once you’re in, you could log in in order that your system knows you’ve accepted the gate. Use the command below to log in:

huggingface-cli login

The next snippet will download the 8B parameter version of SD3.5 in torch.bfloat16 precision.
That is the format utilized in the unique checkpoint published by Stability AI, and is the beneficial approach to run inference.

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16
).to("cuda")

image = pipe(
    prompt="a photograph of a cat holding an indication that claims hello world",
    negative_prompt="",
    num_inference_steps=40,
    height=1024,
    width=1024,
    guidance_scale=4.5,
).images[0]

image.save("sd3_hello_world.png")

The discharge also comes with a “timestep-distilled” model that eliminates classifier-free guidance and lets us generate images in fewer steps (typically in 4-8 steps).

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-large-turbo", torch_dtype=torch.bfloat16
).to("cuda")

image = pipe(
    prompt="a photograph of a cat holding an indication that claims hello world",
    num_inference_steps=4,
    height=1024,
    width=1024,
    guidance_scale=1.0,
).images[0]

image.save("sd3_hello_world.png")

All of the examples shown in our SD3 blog post and the official Diffusers documentation should already work with SD3.5.
Particularly, each of those resources dive deep into optimizing the memory requirements to run inference.
Since SD3.5 Large is significantly larger than SD3 Medium, memory optimization becomes crucial to permit inference on consumer interfaces.

Running inference with quantization

Diffusers natively supports working with bitsandbytes quantization, which optimizes memory much more.

First, make sure that to put in all of the libraries mandatory:

pip install -Uq git+https://github.com/huggingface/transformers@most important
pip install -Uq bitsandbytes

Then load the transformer in “NF4” precision:

from diffusers import BitsAndBytesConfig, SD3Transformer2DModel
import torch

model_id = "stabilityai/stable-diffusion-3.5-large"
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = SD3Transformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16
)

And now, we’re able to run inference:

from diffusers import StableDiffusion3Pipeline

pipeline = StableDiffusion3Pipeline.from_pretrained(
    model_id, 
    transformer=model_nf4,
    torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()

prompt = "A whimsical and artistic image depicting a hybrid creature that could be a mixture of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. Nevertheless, as an alternative of the standard grey skin, the creature's body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square full of a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the plush, pancake-like foliage within the background, a towering pepper mill standing in for a tree.  Because the sun rises on this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"
image = pipeline(
    prompt=prompt,
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]
image.save("whimsical.png")

You’ll be able to control other knobs within the BitsAndBytesConfig. Confer with the documentation for details.

It is usually possible to directly load a model quantized with the identical nf4_config as above.
This is especially helpful for machines with low RAM. Confer with this Colab Notebook for an end-to-end example.

Training LoRAs with SD3.5 Large with quantization

Because of libraries like bitsandbytes and peft, it is feasible to fine-tune large models like SD3.5 Large on consumer GPU cards having 24GBs of VRAM. It’s already possible to leverage our existing SD3 training script for training LoRAs.
The below training command already works:

speed up launch train_dreambooth_lora_sd3.py 
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-3.5-large"  
  --dataset_name="Norod78/Yarn-art-style" 
  --output_dir="yart_art_sd3-5_lora" 
  --mixed_precision="bf16" 
  --instance_prompt="Frog, yarn art style" 
  --caption_column="text"
  --resolution=768 
  --train_batch_size=1 
  --gradient_accumulation_steps=1 
  --learning_rate=4e-4 
  --report_to="wandb" 
  --lr_scheduler="constant" 
  --lr_warmup_steps=0 
  --max_train_steps=700 
  --rank=16 
  --seed="0" 
  --push_to_hub

Nevertheless, to make it work with quantization, we’d like to tweak a few knobs. Below, we offer tips about the way to try this:

We initialize transformer either with a quantization config or load a quantized checkpoint directly.
Then, we prepare it through the use of the prepare_model_for_kbit_training() from peft.
The remaining of the method stays the identical, because of peft‘s strong support for bitsandbytes!

Confer with this instance script for a fuller example.

Using single-file loading with the Stable Diffusion 3.5 Transformer

You’ll be able to load the Stable Diffusion 3.5 Transformer model using the unique checkpoint files published by Stability AI with the from_single_file method:

import torch
from diffusers import SD3Transformer2DModel, StableDiffusion3Pipeline

transformer = SD3Transformer2DModel.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo/blob/most important/sd3.5_large.safetensors",
    torch_dtype=torch.bfloat16,
)
pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-large",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
image = pipe("a cat holding an indication that claims hello world").images[0]
image.save("sd35.png")

Necessary links

Acknowledgements: Daniel Frank for the background photo utilized in the thumbnail of this blog post. Because of Pedro Cuenca and Tom Aarsen for his or her reviews on the post draft.

Source link

Diffusers welcomes Stable Diffusion 3.5 Large

Table Of Contents

Architectural changes

Using SD3.5 with Diffusers

Running inference with quantization

Training LoRAs with SD3.5 Large with quantization

Using single-file loading with the Stable Diffusion 3.5 Transformer

Necessary links

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

WebGPU Support, Recent Models & Tasks, and More…

The Machine Learning “Advent Calendar” Day 24: Transformers for Text in Excel

4 Techniques to Optimize AI Coding Efficiency

structured generation in Rust and Python

Is Your Model Time-Blind? The Case for Cyclical Feature Encoding

Diffusers welcomes Stable Diffusion 3.5 Large

Table Of Contents

Architectural changes

Using SD3.5 with Diffusers

Running inference with quantization

Training LoRAs with SD3.5 Large with quantization

Using single-file loading with the Stable Diffusion 3.5 Transformer

Necessary links

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.