Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs

In 2025, NVIDIA partnered with Black Forest Labs (BFL) to optimize the FLUX.1 text-to-image model series, unlocking FP4 image generation performance on NVIDIA Blackwell GeForce RTX 50 Series GPUs.

As a natural extension of the latent diffusion model, FLUX.1 Kontext [dev] proved that in-context learning is a feasible technique for visual-generation models, not only large language models (LLMs). To make this experience more widely accessible, NVIDIA collaborated with BFL to enable a near real-time editing experience using low-precision quantization.

FLUX.2 is a big breakthrough, offering the general public multi-image references and quality comparable to the very best enterprise models. Nonetheless, because FLUX.2 [dev] requires substantial compute resources, BFL, Comfortable, and NVIDIA collaborated to attain a serious breakthrough: reducing the FLUX.2 [dev] memory requirement by greater than 40% and enabling local deployment through ComfyUI. This optimization, using FP8 precision, has made FLUX.2 [dev] probably the most popular models within the image-generation space.

With FLUX.2 [dev] established because the gold standard for open weight models, the NVIDIA team, in collaboration with BFL, is now excited to share the following leap in performance: 4-bit acceleration for FLUX.2 [dev] on probably the most powerful data center NVIDIA Blackwell GPUs, including NVIDIA DGX B200 and NVIDIA DGX B300.

This post walks through the assorted inference optimization techniques the team used to speed up FLUX.2 [dev] on these NVIDIA data center architectures, including code snippets and steps to start. The combined effect of those optimizations is a remarkable reduction in latency, enabling efficient deployment on data center GPUs.