In 2025, NVIDIA partnered with Black Forest Labs (BFL) to optimize the FLUX.1 text-to-image model series, unlocking FP4 image generation performance on NVIDIA Blackwell GeForce RTX 50 Series GPUs.
As a natural extension of the latent diffusion model, FLUX.1 Kontext [dev] proved that in-context learning is a feasible technique for visual-generation models, not only large language models (LLMs). To make this experience more widely accessible, NVIDIA collaborated with BFL to enable a near real-time editing experience using low-precision quantization.
FLUX.2 is a big breakthrough, offering the general public multi-image references and quality comparable to the very best enterprise models. Nonetheless, because FLUX.2 [dev] requires substantial compute resources, BFL, Comfortable, and NVIDIA collaborated to attain a serious breakthrough: reducing the FLUX.2 [dev] memory requirement by greater than 40% and enabling local deployment through ComfyUI. This optimization, using FP8 precision, has made FLUX.2 [dev] probably the most popular models within the image-generation space.
With FLUX.2 [dev] established because the gold standard for open weight models, the NVIDIA team, in collaboration with BFL, is now excited to share the following leap in performance: 4-bit acceleration for FLUX.2 [dev] on probably the most powerful data center NVIDIA Blackwell GPUs, including NVIDIA DGX B200 and NVIDIA DGX B300.
This post walks through the assorted inference optimization techniques the team used to speed up FLUX.2 [dev] on these NVIDIA data center architectures, including code snippets and steps to start. The combined effect of those optimizations is a remarkable reduction in latency, enabling efficient deployment on data center GPUs.
Visual comparison between BF16 and NVFP4 with FLUX.2 [dev]
Before diving into the specifics, try the output quality of FLUX.2 [dev] on the default BF16 precision with the remarkably similar results achieved using NVFP4 (Figures 1 and a pair of).
The prompt for Figure 1 is, “A cat naps peacefully on a comfortable sofa. The sofa is perched atop a tall tree that grows from the surface of the moon. Earth hangs in the space, a vibrant blue and green jewel within the darkness of space. A sleek spaceship hovers nearby, casting a soft light on the scene, while all the digital art composition exudes a dreamlike quality.”


The prompt for Figure 2 is, “An oil painting of a pair in formal evening wear going home are caught in a heavy downpour with no umbrellas.” On this case, discrepancies are tougher to discover. Essentially the most distinguished is the smile of the gentleman within the BF16 image and multiple umbrellas within the background of the NVFP4 image. Except for these, nearly all of wonderful details are preserved in each the foreground and background of each images.


Optimizing FLUX.2 [dev]
The FLUX.2 [dev] model consists of three key components: a text embedding model, specifically Mistral Small 3, the diffusion transformer model, and an autoencoder. The NVIDIA team applied the next optimization techniques to the open source diffusers implementation using a prototype runtime staged within the TensorRT-LLM/feat/visual_gen branch:
- NVFP4 quantization
- Timestep Embedding Aware Caching (TeaCache)
- CUDA Graphs
- Torch compile
- Multi-GPU inferencing
NVFP4 quantization
NVFP4 advances the concept of microscaling data formats by introducing a two-level microblock scaling strategy. This approach is designed to attenuate accuracy degradation and features two distinct mechanisms: per-tensor scaling and per-block scaling.
Per-tensor scaling is a worth stored in FP32 precision, which adjusts the general tensor distribution and could be statically or dynamically computed. In contrast, per-block scaling is calculated dynamically in real-time by dividing the tensor into blocks of 16 elements.
For max flexibility, users can decide to retain specific layers in the next precision and apply dynamic quantization, as shown in the next example using FLUX.2 [dev]:
exclude_pattern =
r"^(?!.*(embedder|norm_out|proj_out|to_add_out|to_added_qkv|stream)).*"
The applying of the NVFP4 computation is applied using the next statement:
from visual_gen.layers import apply_visual_gen_linear
apply_visual_gen_linear(
model,
load_parameters=True,
quantize_weights=True,
exclude_pattern=exclude_pattern,
)
TeaCache
The TeaCache technique is used to speed up the inference process. TeaCache conditionally skips a diffusion step by utilizing the previous latent generated in the course of the diffusion process. To quantify this effect, tests were conducted: in a scenario with 20 prompts and a 50-step inference process, TeaCache bypassed a mean of 16 steps, which resulted in an approximate 30% reduction in inference latency.
To find out the optimal configuration for the TeaCache hyperparameters, a grid search methodology was employed. The configuration yields the optimal balance between computational speed and generation quality.
dit_configs = {
...
"teacache": {
"enable_teacache": True,
"use_ret_steps": True,
"teacache_thresh": 0.05,
"ret_steps": 10,
"cutoff_steps": 50,
},
...
}
The scaling factor for the caching mechanism was determined empirically and approximated through a third-degree polynomial. This polynomial was fitted using a calibration dataset comprising text-to-image and multireference-image-generation examples.
Figure 3 illustrates this empirical approach, plotting the raw calibration data points alongside the resulting third-degree polynomial curve (shown in red) that models the connection between the modulated input difference and the model’s output difference.


CUDA Graphs
NVIDIA TensorRT-LLM visual_gen provides a ready-made wrapper to support CUDA Graphs capture. Simply import the wrapper and replace the forward function:
from visual_gen.utils.cudagraph import cudagraph_wrapper
model.forward = cudagraph_wrapper(model.forward)
Torch compile
In the entire team’s experiments, torch.compile was enabled aside from the baseline run, because it just isn’t enabled in FLUX.2 [dev] by default.
model = torch.compile(model)
Multiple-GPU support
Enabling multiple GPUs using TensorRT-LLM visual_gen involves 4 steps:
- Modify the
model.forwardfunction to insert codes handling inter-GPU communication - Replace the eye implementation in your model with
ditAttnProcessor - Select parallel algorithm and set parallelism size in config
- Launch with torchrun
The next snippet provides an example. Insert the split code to the start of model.forward to spread input data across multiple GPUs:
from visual_gen.utils import (
dit_sp_gather,
dit_sp_split,
)
# ...
hidden_states = dit_sp_split(hidden_states, dim=1)
encoder_hidden_states = dit_sp_split(encoder_hidden_states, dim=1)
img_ids = dit_sp_split(img_ids, dim=1)
txt_ids = dit_sp_split(txt_ids, dim=1)
Subsequently, insert gather code to the top of model.forward before return:
output = dit_sp_gather(output, dim=1)
Then replace the original attention implementation with the provided attention processor that ensures proper communication across multiple GPUs:
from visual_gen.layers import ditAttnProcessor
# ...
def attention(...):
# ...
x = ditAttnProcessor().visual_gen_attn(q, k, v, tensor_layout="HND")
# ...
Set the proper parallel size within the configuration. For instance, to make use of Ulysses parallelism on 4 GPUs:
dit_config = {
...
"parallel": {
dit_ulysses_size": 4,
}
...
}
Finally, call the setup_configs API to activate the configs:
visual_gen.setup_configs(**dit_configs)
When using multiple GPUs, the script should be launched with torchrun. TensorRT-LLM visual_gen will use the rank information from torchrun and handle all of the communication and job splitting accurately.
Performance evaluation
All of the inference optimizations have been included in an end-to-end FLUX.2 [dev] example—low-precision kernels, the caching technique, and multi-GPU inference.
As shown in Figure 4, the NVIDIA DGX B200 architecture delivers a 1.7x generational leap over NVIDIA H200, even when using default BF16 precision. Further, the layered application of inference optimizations—including CUDA Graphs, torch.compile, NVFP4 precision, and TeaCache—incrementally boosts single-B200 performance from that baseline to a considerable 6.3x speedup.
Ultimately, multi-GPU inference on a two-B200 configuration achieves a ten.2x performance increase in comparison with the H200, the present industry standard.


Baseline is the unique FLUX.2 [dev] with none optimization and without torch.compile enabled. The optimized series includes enabling torch.compile, CUDA Graphs, NVFP4, and TeaCache. The variety of diffusion steps used was 50 within the benchmarks.
On a single GPU, the team found that NVFP4 and TeaCache provide a great tradeoff between speedup and output quality, delivering roughly 2x speedups each. torch.compile is a near-lossless acceleration technique that the majority developers are conversant in, but the advantages are limited. CUDA Graphs are mostly helpful for multi-GPU inference, unlocking incremental scaling using multiple GPUs on NVIDIA B200. Finally, the general pipeline proves robust with FP8 quantization of the text encoder, providing additional advantages for large-scale deployments.
On multi-GPU, the TensorRT-LLM visual_gen sequence parallelism delivers near-linear scaling as when adding more GPUs. The identical effect is observed on NVIDIA Blackwell B200 and GB200 and NVIDIA Blackwell Ultra B300 and GB300 GPUs. Additional optimizations are in progress for NVIDIA Blackwell Ultra GPUs.


Start with FLUX.2 on NVIDIA Blackwell GPUs
FLUX.2 is a big advancement in image generation, successfully combining high-quality outputs with user-friendly deployment options. The NVIDIA team, in collaboration with BFL, achieved substantial acceleration for FLUX.2 [dev] on probably the most powerful NVIDIA data center GPUs.
Applying recent techniques to the FLUX.2 [dev] model, including NVFP4 quantization and TeaCaching, delivers a strong generational leap in inference speed. The combined effect of those optimizations is a remarkable reduction in latency, enabling efficient deployment on NVIDIA data center GPUs.
To start constructing your individual inference pipeline with these state-of-the-art optimizations, try the end-to-end FLUX.2 example and accompanying code on the NVIDIA/TensorRT-LLM/visual_gen GitHub repo.
