In previous posts on FP8 training, we explored the fundamentals of FP8 precision and took a deep dive into the various scaling recipes for practical large-scale deep learning. In the event you haven’t read those yet, we recommend starting there for a solid foundation.
This post focuses on what matters most in production: speed. FP8 training guarantees faster computation, but how much real-world acceleration does it actually deliver? And what are the hidden overhead penalties which may diminish these theoretical gains?
We’ll compare the leading FP8 scaling recipes side by side, using real benchmarks on NVIDIA H100 and NVIDIA DGX B200 GPUs. We rigorously evaluate each FP8 recipe using NVIDIA NeMo Framework—from delayed and current scaling to MXFP8 and generic block scaling—by way of training efficiency, numerical stability, hardware compatibility, and scalability as model sizes increase.
By examining each convergence behavior and throughput across diverse LLMs, this post provides clear, actionable insights into how each approach performs in practical, demanding scenarios.
Why does speedup matter in FP8 training?
Training LLMs and other state-of-the-art neural networks is an increasingly resource-intensive process, demanding vast computational power, memory, and time. As each model and dataset scales proceed to grow, the associated costs—financial, environmental, and temporal—have turn out to be a central concern for researchers and practitioners.
FP8 precision directly addresses these challenges by fundamentally improving computational efficiency. By reducing numerical precision from 16 or 32 bits right down to just 8 bits, FP8 enables significantly faster computation, which translates directly into accelerated research cycles, reduced infrastructure expenditures, and the unprecedented ability to coach larger, more ambitious models on existing hardware.
Beyond raw computational speed, FP8 also critically reduces communication overhead in distributed training environments, as lower-precision activations and gradients mean less data must be transferred between GPUs, directly alleviating communication bottlenecks and helping maintain high throughput at scale, a bonus that becomes increasingly vital as model and cluster sizes expand.
What are the strengths and trade-offs of FP8 scaling recipes?
This section briefly recaps the 4 primary FP8 scaling approaches evaluated on this work, highlighting their unique characteristics. For a deeper dive into the mechanics and implementation details of every recipe, see Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training.
- Per-tensor delayed scaling: Offers good FP8 computation performance through the use of a stable, history-derived scaling factor, but its robustness could be impacted by outlier values within the amax history, potentially resulting in instabilities and hindering overall training.
- Per-tensor current scaling: Provides high responsiveness and fast adaptation to tensor ranges, resulting in improved model convergence and maintaining minimal computational and memory overhead as a consequence of its real-time amax calculation and lack of historical tracking.
- Sub-channel (generic block) scaling: Enhances precision and may unlock full FP8 efficiency by allowing configurable block dimensions and finer-grained scaling, though smaller blocks increase scaling factor storage overhead and transpose operations may involve re-computation.
- MXFP8: As a hardware-native solution, this recipe delivers highly efficient block scaling with fixed 32-value blocks for each activations and weights and E8M0 power-of-2 scales, leading to significant performance gains (as much as 2x GEMM throughput) and minimized quantization error through NVIDIA Blackwell accelerated operations.
| Scaling recipe | Speedup | Numerical stability | Granularity | Really useful models | Really useful hardware |
| Delayed scaling | High | Moderate | Per tensor | Small dense models | NVIDIA Hopper |
| Current scaling | High | Good | Per tensor | Medium-sized dense and hybrid models | NVIDIA Hopper |
| Sub-channel scaling | Medium | High | Custom 2D block of 128×128 | MoE models | NVIDIA Hopper and Blackwell |
| MXFP8 | Medium | High | Per 32-value block | All | NVIDIA Blackwell and Grace-Blackwell |
Scaling recipe granularity
Figure 1 shows measured FP8 higher-precision matrix multiplications (GEMM) throughput speedup over BF16 for various scaling approaches on NVIDIA H100. Hardware-native scaling (channel-wise, subchannel-wise, tensor-wise) achieves as much as 2x acceleration, underscoring why FP8 is so effective on the hardware level.
While FP8 offers significant speedups over BF16, the selection of scaling granularity; that’s, how finely scaling aspects are applied inside a tensor introduces nuanced trade-offs in actual performance, particularly for GEMM operations. Finer granularity, while useful for numerical stability and accuracy by higher accommodating intra-tensor variability, can introduce additional overhead that impacts raw throughput.


A transparent hierarchy in performance is observed when various scaling granularities for GEMM operations. Tensor-wise scaling generally demonstrates the very best speedup. With only a single scaling factor per entire tensor involved within the GEMM, the overhead related to scale management is minimized.
Channel-wise scaling represents an intermediate level of granularity, typically applying a scaling factor per channel or a row/column. As seen within the figure, its speedup falls between tensor-wise and 2D block-wise methods.
Sub-channel-wise 2D2D Scaling (for instance, with 1×128 for activations and 128×128 blocks for weights) method, representing a finer granularity, generally exhibits barely lower speedups in comparison with tensor-wise scaling. The management of multiple scaling aspects for the various smaller blocks inside a tensor introduces a computational cost that, while crucial for accuracy, can reduce peak raw throughput. This holds true for other configurable block dimensions like 1D1D or 1D2D, where finer block divisions mean more scales to process per GEMM.
Crucially, the x-axis in Figure 1 highlights the impact of GEMM size. As K increases (meaning larger GEMM operations), the general speedup of FP8 over BF16 generally improves across all scaling methods. It is because for larger GEMMs, the computational savings from using 8-bit precision turn out to be more dominant, outweighing the relative overhead of managing scaling aspects. In essence, larger GEMMs allow the inherent advantages of FP8 compute to shine through more effectively, even with the added complexity of finer-grained scaling.
While hardware-native solutions like MXFP8 are designed to mitigate the overhead of block scaling through dedicated Tensor Core acceleration, for general FP8 block scaling implementations, the trade-off between granularity (for accuracy) and raw performance stays a key consideration.
Beyond raw speedup, a critical aspect of low-precision training is convergence—how well the model learns and reduces its loss, and ultimately, the way it performs on specific downstream tasks. While training loss provides useful insight into the educational process, it’s necessary to do not forget that it’s not the only real metric for FP8 efficacy; robust FP8 downstream evaluation metrics are the final word arbiters of a model’s quality.


When adopting FP8, the expectation is that the training loss trajectory should closely mirror that of a higher-precision baseline, corresponding to BF16, to make sure that the model is learning effectively without significant degradation. As shown in Figure 2, the training loss trajectories for various scaling strategies relative to BF16. The pink line represents the BF16 baseline. Notably, the dark purple line, representing FP8-blockwise scaling, consistently follows a trajectory very much like BF16. This close alignment indicates that with finer granularity, block-wise scaling can preserve numerical fidelity more effectively, resulting in a convergence behavior that closely matches the higher-precision BF16 training.
Conversely, the sunshine green line, representing FP8-per-tensor scaling, occasionally shows slight deviations or higher fluctuations in loss. This subtle difference in convergence trajectory highlights the trade-off inherent in granularity: while coarser-grained per-tensor scaling might offer higher raw GEMM throughput as discussed previously, finer-grained block-wise scaling tends to yield less accuracy loss and a more stable learning path that closely mirrors BF16.
This illustrates the crucial balance between speedup and numerical stability in FP8 training. More granular scaling methods, by higher accommodating the various dynamic ranges inside tensors, can result in convergence trajectories that more faithfully track higher-precision baselines, though this might include a corresponding difference in speed in comparison with less granular approaches. The optimal selection often involves weighing the demands of downstream evaluation against available computational resources and desired training speed.
Experimental setup
All experiments on this post were conducted using NVIDIA NeMo Framework 25.04, the newest release of the NeMo framework on the time of writing. NeMo Framework 25.04 provides robust, production-grade support for FP8 training through the NVIDIA Transformer Engine (TE), and includes out-of-the-box recipes for dense architectures.
We evaluated two leading FP8 approaches: the present scaling recipe on H100 GPUs and the MXFP8 recipe on the newer NVIDIA DGX B200 architecture. For each, we tested a spread of state-of-the-art models, including Llama 3 8B, Llama 3 70B, Llama 3.1 405B, Nemotron 15B, and Nemotron 340B. Each setup was compared directly against a BF16 baseline to measure the sensible speedup delivered by FP8 in real-world training scenarios.
Current scaling recipe
As illustrated in Figure 3, the present scaling FP8 recipe on H100 GPUs demonstrates a pronounced, model-size-dependent speedup in comparison to the BF16 baseline. For smaller models corresponding to Llama3 8B, the speedup is roughly 1.30x.
This advantage becomes much more significant with larger architectures. For instance, the Llama 3 70B model achieves a speedup of 1.43x, and the biggest model in our benchmark suite, Llama 3.1 405B, reaches a formidable 1.53x acceleration.


This upward trend shouldn’t be only a statistical curiosity—it underscores a fundamental advantage of FP8 training for large-scale language models. As model size and computational complexity increase, the efficiency gains from reduced-precision arithmetic turn out to be more pronounced.
The rationale is twofold: First, larger models naturally involve more matrix multiplications and data movement, each of which profit substantially from the reduced memory footprint and better throughput of FP8 on modern hardware. Second, the overheads related to scaling and dynamic range adjustments turn out to be relatively less important as the overall computation grows, allowing the raw performance advantages of FP8 to dominate.
MXFP8 recipe
Figure 4 shows the performance of the MXFP8 recipe on DGX B200 GPUs, revealing a consistent speedup over BF16 across different model sizes, with observed gains starting from 1.28x to 1.37x. While these absolute speedup values are barely lower than those achieved by the present scaling recipe, they’re notable for his or her stability and reliability across a various set of models.


The relative flatness in speedup from 8B to 70B parameters—contrasted with the upper jump at 340B—reflects how block-based scaling interacts with model and hardware characteristics. MXFP8 assigns a shared scaling factor to every 32-element block, which might introduce additional memory access overhead for mid-sized models. Nevertheless, as model size increases and computation becomes the dominant bottleneck (as seen with Nemotron 340B), the efficiency advantages of block-wise FP8 turn out to be more pronounced, resulting in the observed peak speedup.
These results highlight the architectural strengths of the Blackwell (B200) platform, whose Tensor Cores and memory hierarchy are optimized for microscaling formats like MXFP8. This allows high throughput and stable convergence, whilst models scale into the lots of of billions of parameters. The block-level scaling approach of MXFP8 effectively balances dynamic range and computational efficiency, delivering reliable acceleration while mitigating risks of numerical instability.
This consistency reflects the architectural advancements of NVIDIA Blackwell architecture, which was purpose-built to maximise efficiency for lower-precision formats like FP8 and, specifically, for block-based scaling approaches corresponding to MXFP8. The B200 Tensor Cores and advanced memory hierarchy are optimized for these microscaling formats, enabling high throughput and efficient memory utilization whilst model sizes proceed to extend. With MXFP8, each block of 32 values shares a scaling factor, striking a balance between dynamic range and computational efficiency. This approach allows for robust acceleration while minimizing the chance of numerical instability—a key consideration when pushing models to ever-larger scales.
How does NVIDIA GB200 Grace Blackwell Superchip compare to NVIDIA Blackwell architecture?
The comparison between GB200 and B200 highlights how architectural integration and system design can translate into tangible performance gains for large-scale AI workloads. Each are built on NVIDIA Blackwell architecture, however the GB200 superchip combines two B200 GPUs with a Grace CPU, interconnected through NVIDIA NVLink, leading to a unified memory domain and exceptionally high memory bandwidth.


Start with practical FP8 training
A transparent pattern emerges from these benchmarks: for dense models, the larger the model, the larger the speedup with FP8. It is because as model size increases, the variety of matrix multiplications (GEMMs) grows rapidly, and these operations profit most from the reduced precision and better throughput of FP8. In large dense models, FP8 enables dramatic efficiency gains, making it possible to coach and fine-tune ever-larger language models with less time and compute.
These empirical results reinforce the particular strengths and tradeoffs of every FP8 scaling recipe detailed on this post and display that each per-tensor and MXFP8 approaches deliver significant speedup and convergence advantages over BF16.
Able to try these techniques yourself? Explore the FP8 recipes to start with practical FP8 training configurations and code.
