The most recent AI models proceed to grow in size and complexity, demanding increasing amounts of compute performance for training and inference—far beyond what Moore’s Law can sustain with. That’s why NVIDIA engages in extreme codesign. Designing across multiple chips and a mountain of software cohesively enables large generational leaps in AI factory performance and efficiency.
Lower-precision AI formats are key to improving compute performance and energy efficiency. Bringing the advantages of ultra-low-precision numerics to AI training and inference while maintaining high accuracy requires extensive engineering across every layer of the technology stack. It spans the creation of the formats, implementation in silicon, enablement across many libraries, and dealing closely with the ecosystem to deploy recent training recipes and inference optimization techniques. NVFP4, developed and implemented for NVIDIA GPUs starting with NVIDIA Blackwell, delivers the performance and energy-efficiency advantages of 4-bit floating-point precision while maintaining accuracy on par with higher-precision formats.
For those seeking to maximize AI training and inference performance, listed here are three things to find out about NVFP4.
1. NVFP4 enables large performance leaps for training and inference on the Blackwell architecture—and beyond
NVIDIA Blackwell Ultra GPUs provide peak dense NVFP4 throughput as much as 15 petaFLOPS—3x that of FP8 on the identical GPUs. The gains aren’t nearly peak specs; they’re visible in measured performance of training and inference workloads.
For inference, as shown in a recent technical blog post, moving from FP8 to NVFP4 results in dramatic improvements in delivered token throughput at a given level of interactivity on DeepSeek-R1, a preferred, 671B parameter, mixture-of-experts (MoE) model. The throughput increases at a given token rate and even higher token rates, enabling higher user experiences.


NVIDIA also recently published an NVFP4 training recipe, bringing the numerous performance advantages of NVFP4 to model training, enabling model makers to coach AI faster and at lower cost.


In the most recent version of the MLPerf Training benchmark suite, multiple NVIDIA GB300 NVL72 systems—totaling 512 Blackwell Ultra GPUs—worked together using NVFP4 precision to finish the Llama 3.1 405B pre-training benchmark in 64.6 minutes. That is 1.9x faster than 512 Blackwell GPUs across multiple NVIDIA GB200 NVL72 systems, which were in a position to complete the benchmark using FP8 within the prior round.
Looking ahead, the NVIDIA Rubin platform delivers large leaps in NVFP4 capability for training and inference, offering 35 petaFLOPS of NVFP4 training compute, and 50 petaFLOPs of NVFP4 Transformer Engine inference compute. It is a 3.5x and 5x leap in comparison with Blackwell, respectively.
2. NVFP4 delivers great accuracy, proven on industry benchmarks
For MLPerf Training and Inference submissions within the closed division to be valid, they have to meet accuracy requirements specified by the benchmarks. For inference, responses must meet certain accuracy thresholds, and for training, the models have to be trained to specific quality targets (ie, the model training process must converge).
NVIDIA successfully submitted ends in the closed division on every large language model (LLM) test using NVFP4 on Blackwell and Blackwell Ultra GPUs in the most recent version of MLPerf Training. And, NVIDIA has submitted across many models and scenarios using NVFP4 in MLPerf Inference. This included DeepSeek-R1, Llama 3.1 8B and 405B, and Llama 2 70B. NVIDIA used NVFP4-quantized versions of the models, all while meeting strict benchmark requirements.


3. NVFP4 enjoys broad and growing ecosystem support
Libraries like NVIDIA Model Optimizer, LLM Compressor, and torch.ao enable developers to quantize models trained at higher precision to NVFP4 and implement NVFP4 KV cache to support long context and huge batch sizes while preserving accuracy. Popular inference frameworks, including NVIDIA TensorRT-LLM, vLLM, and SGLang, also support running models in NVFP4 format today with models available in NVFP4 variants. For instance, on HuggingFace, developers can find ready-to-deploy NVFP4 versions akin to Llama 3.3 70B, FLUX.2, DeepSeek-R1-0528, Kimi-K2-Considering, Qwen3-235B-A22B, and NVIDIA Nemotron Nano.
The ecosystem can also be adopting NVFP4 to extend inference throughput in production across quite a lot of models. Those corporations include Black Forest Labs, Radical Numerics, and Cognition.
Black Forest Labs worked with NVIDIA to scale NVFP4 inference for FLUX.2 on Blackwell. “By layering optimizations like CUDA Graphs, torch.compile, NVFP4 precision, and TeaCache, we achieve as much as a 6.3x speedup on a single B200—dramatically reducing latency and enabling more efficient production deployment,” said Robin Rombach, co-founder and CEO of Black Forest Labs.
Radical Numerics has leveraged NVFP4 to speed up scientific world model scaling. “Unlike language, scientific data pushes us beyond the classical single-modality autoregressive recipe, demanding extremely long-context methods and robust multimodal fusion,” said Michael Poli, co-founder and chief AI scientist at Radical Numerics. He added the corporate is “highly optimistic” about using low-precision recipes to pretrain and post-train its recent architecture.
And Cognition is seeing “significant latency and throughput gains” by utilizing NVFP4 in large-scale reinforcement learning, said Steven Cao, a member of Cognition’s research team.
The NVIDIA Transformer Engine library incorporates an implementation of the NVFP4 training recipe, and training frameworks akin to Megatron-Bridge have implementations for developers to start. NVIDIA also continues to innovate and collaborate with the ecosystem to bring the performance and efficiency advantages of NVFP4 training to the complete ecosystem, paving the strategy to smarter, more complex models trained faster and more efficiently.
Learn more
Using NVFP4 can deliver large performance gains on each the NVIDIA Blackwell and NVIDIA Rubin platforms. Through extreme codesign, these large performance gains may also be achieved with excellent accuracy for each model training and inference. NVFP4 versions of popular open LLMs are widely available, enabling services to run these models with higher throughput and at a lower cost per million tokens.
Learn more about how the numerous architectural leaps enabled by the Rubin platform, including enhanced NVFP4, enable recent levels of performance of AI training and inference.
