Diffusers welcomes FLUX-2

FLUX.2 is the recent series of image generation models from Black Forest Labs, preceded by the Flux.1 series. It’s a completely recent model with a recent architecture and pre-training done from scratch!

On this post, we discuss the important thing changes introduced in FLUX.2, performing inference with it under various setups, and LoRA fine-tuning.

🚨 FLUX.2 will not be meant to be a drop-in alternative of FLUX.1, but a brand new generation model

Table of contents

FLUX.2: A Temporary Introduction

FLUX.2 might be used for each image-guided and text-guided image generation. Moreover, it might take multiple images as reference inputs, while producing the ultimate output image. Below, we briefly discuss the important thing changes introduced in FLUX.2.

Text encoder

First, as a substitute of two text encoders as in Flux.1, it uses a single text encoder — Mistral Small 3.1. Using a single text encoder greatly simplifies the means of computing prompt embeddings. The pipeline allows for a max_sequence_length of 512.

DiT

FLUX.2 follows the identical general multimodel diffusion transformer (MM-DiT) + parallel DiT architecture as Flux.1. As a refresher, MM-DiT blocks first process the image latents and conditioning text in separate streams, only joining the 2 together for the eye operation, and are thus known as “double-stream” blocks. The parallel blocks then operate on the concatenated image and text streams and might be considered “single-stream” blocks.

The important thing DiT changes from Flux.1 to FLUX.2 are as follows:

Time and guidance information (in the shape of AdaLayerNorm-Zero modulation parameters) is shared across all double-stream and single-stream transformer blocks, respectively, somewhat than having individual modulation parameters for every block as in Flux.1.
Not one of the layers within the model use bias parameters. Particularly, neither the eye nor feedforward (FF) sub-blocks of either transformer block use bias parameters in any of their layers.
In Flux.1, the single-stream transformer blocks fused the eye output projection with the FF output projection. FLUX.2 single-stream blocks also fuse the eye QKV projections with the FF input projection, making a fully parallel transformer block:

Note that in comparison with the ViT-22B block depicted above, FLUX.2 uses a SwiGLU-style MLP activation somewhat than a GELU activation (and in addition doesn’t use bias parameters).

A bigger proportion of the transformer blocks in FLUX.2 are single-stream blocks (8 double-stream blocks to 48 single-stream blocks, in comparison with 19/38 for Flux.1). This also implies that single-stream blocks make up a bigger proportion of the DiT parameters: Flux.1[dev]-12B has ~54% of its total parameters within the double-stream blocks, whereas FLUX.2[dev]-32B has ~24% of its parameters within the double-stream blocks (and ~73% within the single-stream blocks).

Misc

A brand new Autoencoder
Higher technique to incorporate resolution-dependent timestep schedules

Inference With Diffusers

FLUX.2 uses a bigger DiT and Mistral3 Small as its text encoder. When used together with none type of offloading, the inference takes greater than 80GB VRAM. In the next sections, we show find out how to perform inference with FLUX.2 in additional accessible ways, under various system-level constraints.

Installation and Authentication

Before you are attempting out the next code snippets, ensure that you might have installed diffusers from foremost and have run hf auth login.

pip uninstall diffusers -y && pip install git+https://github.com/huggingface/diffusers -U

Regular Inference

from diffusers import Flux2Pipeline
import torch

repo_id = "black-forest-labs/FLUX.2-dev"
pipe = Flux2Pipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

image = pipe(
    prompt="dog dancing near the sun",
    num_inference_steps=50, 
    guidance_scale=4,
    height=1024,
    width=1024
).images[0]

The above code snippet was tested on an H100, and it isn’t sufficient to run inference on it without CPU offloading. With CPU offloading enabled, this setup takes ~62GB to run.

Users who’ve access to Hopper-series GPUs can reap the benefits of Flash Attention 3 to hurry up inference:

from diffusers import Flux2Pipeline
import torch

repo_id = "black-forest-labs/FLUX.2-dev"
pipe = Flux2Pipeline.from_pretrained(path, torch_dtype=torch.bfloat16)
+ pipe.transformer.set_attention_backend("_flash_3_hub")
pipe.enable_model_cpu_offload()

image = pipe(
    prompt="dog dancing near the sun",
    num_inference_steps=50,
    guidance_scale=2.5,
    height=1024,
    width=1024
).images[0]

You possibly can take a look at the supported attention backends (we now have many!) here.

Resource-constrained

Using 4-bit quantization

Using bitsandbytes, we will load the transformer and text encoder models in 4-bit, allowing owners of 24GB GPUs to make use of the model locally. You possibly can run this snippet on a GPU with ~20 GB of free VRAM.

Unfold

import torch
from transformers import Mistral3ForConditionalGeneration

from diffusers import Flux2Pipeline, Flux2Transformer2DModel

repo_id = "diffusers/FLUX.2-dev-bnb-4bit"
device = "cuda:0"
torch_dtype = torch.bfloat16

transformer = Flux2Transformer2DModel.from_pretrained(
  repo_id, subfolder="transformer", torch_dtype=torch_dtype, device_map="cpu"
)
text_encoder = Mistral3ForConditionalGeneration.from_pretrained(
  repo_id, subfolder="text_encoder", dtype=torch_dtype, device_map="cpu"
)

pipe = Flux2Pipeline.from_pretrained(
  repo_id, transformer=transformer, text_encoder=text_encoder, torch_dtype=torch_dtype
)
pipe.enable_model_cpu_offload()

prompt = "Realistic macro photograph of a hermit crab using a soda can as its shell, partially emerging from the can, captured with sharp detail and natural colours, on a sunlit beach with soft shadows and a shallow depth of field, with blurred ocean waves within the background. The can has the text `BFL Diffusers` on it and it has a color gradient that start with #FF5733 at the highest and transitions to #33FF57 at the underside."

image = pipe(
  prompt=prompt,
  generator=torch.Generator(device=device).manual_seed(42),
  num_inference_steps=50, 
  guidance_scale=4,
).images[0]

image.save("flux2_t2i_nf4.png")

Local + distant

As a consequence of the modular design of a Diffusers pipeline, we will isolate modules and work with them in sequence. We decouple the text encoder and deploy it to an Inference Endpoint. This helps us with freeing up the VRAM usage for the DiT and VAE only.

⚠️ To make use of the distant text encoder, it’s essential have a valid token. For those who are already authenticated, no further motion is required.

The instance below uses a mixture of local and distant inference. Moreover, we quantize the DiT with NF4 quantization through bitsandbytes.

You possibly can run this snippet on a GPU with 18 GB of VRAM:

Unfold

from diffusers import Flux2Pipeline, Flux2Transformer2DModel
from diffusers import BitsAndBytesConfig as DiffBitsAndBytesConfig
from huggingface_hub import get_token
import requests
import torch
import io

def remote_text_encoder(prompts: str | list[str]):
  def _encode_single(prompt: str):
      response = requests.post(
          "https://remote-text-encoder-flux-2.huggingface.co/predict",
          json={"prompt": prompt},
          headers={
              "Authorization": f"Bearer {get_token()}",
              "Content-Type": "application/json"
          }
      )
      assert response.status_code == 200, f"{response.status_code=}"
      return torch.load(io.BytesIO(response.content))

  if isinstance(prompts, (list, tuple)):
      embeds = [_encode_single(p) for p in prompts]
      return torch.cat(embeds, dim=0)

  return _encode_single(prompts).to("cuda")

repo_id = "black-forest-labs/FLUX.2-dev"
quantized_dit_id = "diffusers/FLUX.2-dev-bnb-4bit"
dit = Flux2Transformer2DModel.from_pretrained(
  quantized_dit_id, subfolder="transformer", torch_dtype=torch_dtype, device_map="cpu"
)

pipe = Flux2Pipeline.from_pretrained(
  repo_id,
  text_encoder=None,
  transformer=dit,
  torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()

print("Running distant text encoder ☁️")
prompt1 = "a photograph of a forest with mist swirling across the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture"
prompt2 = "a photograph of a dense forest with rain. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture"
prompt_embeds = remote_text_encoder([prompt1, prompt2])
print("Done ✅")

out = pipe(
  prompt_embeds=prompt_embeds,
  generator=torch.Generator(device="cuda").manual_seed(42),
  num_inference_steps=50, 
  guidance_scale=4,
  height=1024,
  width=1024,
)

for idx, image in enumerate(out.images):
  image.save(f"flux_out_{idx}.png")

For GPUs with even lower VRAM, we now have group_offloading, which allows GPUs with as little as 8GB of free VRAM to make use of this model. Nevertheless, you’ll have 32GB of free RAM. Alternatively, for those who’re willing to sacrifice some speed, you may set low_cpu_mem_usage=True to scale back the RAM requirement to only 10GB.

Unfold

import io
import os

import requests
import torch

from diffusers import Flux2Pipeline, Flux2Transformer2DModel

repo_id = "diffusers/FLUX.2-dev-bnb-4bit"
torch_dtype = torch.bfloat16
device = "cuda"

def remote_text_encoder(prompts: str | list[str]):
  def _encode_single(prompt: str):
      response = requests.post(
          "https://remote-text-encoder-flux-2.huggingface.co/predict",
          json={"prompt": prompt},
          headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}", "Content-Type": "application/json"},
      )
      assert response.status_code == 200, f"{response.status_code=}"
      return torch.load(io.BytesIO(response.content))

  if isinstance(prompts, (list, tuple)):
      embeds = [_encode_single(p) for p in prompts]
      return torch.cat(embeds, dim=0)

  return _encode_single(prompts).to("cuda")

transformer = Flux2Transformer2DModel.from_pretrained(
  transformer_id, subfolder="transformer", torch_dtype=torch_dtype, device_map="cpu"
)

pipe = Flux2Pipeline.from_pretrained(
  repo_id,
  text_encoder=None,
  transformer=transformer,
  torch_dtype=torch_dtype,
)
pipe.transformer.enable_group_offload(
  onload_device=device,
  offload_device="cpu",
  offload_type="leaf_level",
  use_stream=True,
  
)
pipe.to(device)

prompt = "a photograph of a forest with mist swirling across the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture"
prompt_embeds = remote_text_encoder(prompt)

image = pipe(
  prompt_embeds=prompt_embeds,
  generator=torch.Generator(device=device).manual_seed(42),
  num_inference_steps=50,
  guidance_scale=4,
  height=1024,
  width=1024,
).images[0]

You possibly can take a look at other supported quantization backends here and other memory-saving techniques here.

To envision how different quantizations affect a picture, you may play with the playground below or access it as standlone within the FLUX.2 Quantization experiments Space

<br />

Multiple images as reference

FLUX.2 supports using multiple images as inputs, allowing you to make use of as much as 10 images. Nevertheless, be mindful that every additional image would require more VRAM. You possibly can reference the photographs by index (e.g., image 1, image 2) or by natural language (e.g., the kangaroo, the turtle). For optimal results, the most effective approach is to make use of a mixture of each methods.

Unfold

import torch
from transformers import Mistral3ForConditionalGeneration

from diffusers import Flux2Pipeline, Flux2Transformer2DModel
from diffusers.utils import load_image

repo_id = "diffusers-internal-dev/new-model-image-final-weights"
device = "cuda:0"
torch_dtype = torch.bfloat16

pipe = Flux2Pipeline.from_pretrained(
  repo_id, torch_dtype=torch_dtype
)
pipe.enable_model_cpu_offload()

image_one = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/foremost/flux2_blog/kangaroo.png")
image_two = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/foremost/flux2_blog/turtle.png")

prompt = "the boxer kangaroo from image 1 and the martial artist turtle from image 2 are fighting in an epic battle scene at a beach of a tropical island, 35mm, depth of field, 50mm lens, f/3.5, cinematic lighting"

image = pipe(
  prompt=prompt,
  image=[image_one, image_two],
  generator=torch.Generator(device=device).manual_seed(42),
  num_inference_steps=50,
  guidance_scale=2.5,
  width=1024,
  height=768,
).images[0]

image.save(f"./flux2_t2i.png")

LoRA fine-tuning

Being each a text-to-image and an image-to-image model, FLUX.2 makes the proper fine-tuning candidate for a lot of use-cases! Nevertheless, as inference alone takes greater than 80GB of VRAM, LoRA fine-tuning is even tougher to run on consumer GPUs. To squeeze in as much memory saving as we will, we utilize among the inference optimizations described above for training as well, along with shared memory saving techniques, to substantially reduce memory consumption. To coach it, you should utilize either the diffusers code below or Ostris’ AI Toolkit.

We offer each text-to-image and image-to-image training scripts, for the aim of this blog will concentrate on a text-to-image training example.

Memory optimizations for fine-tuning

A lot of these techniques complement one another and might be used together to scale back memory consumption further. Nevertheless, some techniques could also be mutually exclusive, so remember to check before launching a training run.

Unfold to envision details on the memory-saving techniques used:

– **Distant Text Encoder:** to leverage the distant text encoding for training, simply pass `–remote_text_encoder`. Note that it’s essential to either be logged in to your Hugging Face account (`hf auth login`) OR pass a token with `–hub_token`.
– **CPU Offloading:** by passing `–offload` the vae and text encoder to shall be offloaded to CPU memory and only moved to GPU when needed.
– **Latent Caching:** Pre-encode the training images with the vae, after which delete it to unencumber some memory. To enable `latent_caching` simply pass `–cache_latents`.
– **QLoRA**: Low Precision Training with Quantization – using 8-bit or 4-bit quantization. You should utilize the next flags:
**FP8 training** with `torchao`: enable FP8 training by passing `–do_fp8_training`.
– > [!IMPORTANT] Since we’re utilizing FP8 tensor cores, we’d like CUDA GPUs with compute capability not less than 8.9 or greater. > For those who’re searching for memory-efficient training on relatively older cards, we encourage you to envision out other trainers like `SimpleTuner`, `ai-toolkit`, etc.
> **NF4 training** with `bitsandbytes`: Alternatively, you should utilize 8-bit or 4-bit quantization with `bitsandbytes` by passing:- `–bnb_quantization_config_path` with a corresponding path to a json file containing your config. see below for more details.
– **Gradient Checkpointing and Accumulation:** `–gradient accumulation` refers back to the variety of updates steps to build up before performing a backward/update pass.by passing a worth > 1 you may reduce the quantity of backward/update passes and hence also memory reqs.* with `–gradient checkpointing` we will save memory by not storing all intermediate activations throughout the forward pass.As a substitute, only a subset of those activations (the checkpoints) are stored and the remaining is recomputed as needed throughout the backward pass. Note that this comes on the expanse of a slower backward pass.
– **8-bit-Adam Optimizer:** When training with `AdamW`(doesn’t apply to `prodigy`) You possibly can pass `–use_8bit_adam` to scale back the memory requirements of coaching. Be sure to put in `bitsandbytes` if you should accomplish that.

Let’s launch a training run using these memory-saving optimizations.

Please ensure that to envision out the README for prerequisites before starting training.

For this instance, we’ll use multimodalart/1920-raider-waite-tarot-public-domain dataset with the next configuration using FP8 training. Be happy to experiment more with the hyper-parameters and share your results 🤗

speed up launch train_dreambooth_lora_flux2.py 
  --pretrained_model_name_or_path="black-forest-labs/FLUX.2-dev"  
  --mixed_precision="bf16" 
  --gradient_checkpointing 
  --remote_text_encoder 
  --cache_latents 
  --caption_column="caption"
  --do_fp8_training 
  --dataset_name="multimodalart/1920-raider-waite-tarot-public-domain" 
  --output_dir="tarot_card_Flux2_LoRA" 
  --instance_prompt="trcrd tarot card" 
  --resolution=1024 
  --train_batch_size=2 
  --guidance_scale=1 
  --gradient_accumulation_steps=1 
  --optimizer="adamW" 
  --use_8bit_adam
  --learning_rate=1e-4 
  --report_to="wandb" 
  --lr_scheduler="constant_with_warmup" 
  --lr_warmup_steps=200 
  --checkpointing_steps=250
  --max_train_steps=1000 
  --rank=8
  --validation_prompt="a trtcrd of an individual on a pc, on the pc you see a meme being made with an ancient looking trollface, 'the shitposter' arcana, within the sort of TOK a trtcrd, tarot style" 
  --validation_epochs=25 
  --seed="0"
  --push_to_hub

LoRA finetuning

The left image was generated using the pre-trained FLUX.2 model, and the precise image was produced the LoRA.

In case your hardware isn’t compatible with FP8 training, you should utilize QLoRA with bitsandbytes. You first must define a config.json file like so:

{
    "load_in_4bit": true,
    "bnb_4bit_quant_type": "nf4"
}

After which pass its path to --bnb_quantization_config_path:

!speed up launch train_dreambooth_lora_flux2.py 
  --pretrained_model_name_or_path="black-forest-labs/FLUX.2-dev"  
  --mixed_precision="bf16" 
  --gradient_checkpointing 
  --remote_text_encoder 
  --cache_latents 
  --caption_column="caption"
  **--bnb_quantization_config_path="config.json" **
  --dataset_name="multimodalart/1920-raider-waite-tarot-public-domain" 
  --output_dir="tarot_card_Flux2_LoRA" 
  --instance_prompt="a tarot card" 
  --resolution=1024 
  --train_batch_size=2 
  --guidance_scale=1 
  --gradient_accumulation_steps=1 
  --optimizer="adamW" 
  --use_8bit_adam
  --learning_rate=1e-4 
  --report_to="wandb" 
  --lr_scheduler="constant_with_warmup" 
  --lr_warmup_steps=200 
  --max_train_steps=1000 
  --rank=8
  --validation_prompt="a trtcrd of an individual on a pc, on the pc you see a meme being made with an ancient looking trollface, 'the shitposter' arcana, within the sort of TOK a trtcrd, tarot style" 
  --seed="0"

Resources

Source link

Diffusers welcomes FLUX-2

FLUX.2: A Temporary Introduction

Text encoder

DiT

Misc

Inference With Diffusers

Installation and Authentication

Regular Inference

Resource-constrained

Multiple images as reference

LoRA fine-tuning

Memory optimizations for fine-tuning

Resources

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Leading the Korean LLM Evaluation Ecosystem

Welcome Gemma – Google’s recent open LLM

Beyond the Flat Table: Constructing an Enterprise-Grade Financial Model in Power BI

Introducing the Red-Teaming Resistance Leaderboard

Federated Learning, Part 1: The Basics of Training Models Where the Data Lives

Diffusers welcomes FLUX-2

FLUX.2: A Temporary Introduction

Text encoder

DiT

Misc

Inference With Diffusers

Installation and Authentication

Regular Inference

Resource-constrained

Multiple images as reference

LoRA fine-tuning

Memory optimizations for fine-tuning

Resources

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.