Faster Training Throughput in FP8 Precision with NVIDIA NeMo

In previous posts on FP8 training, we explored the fundamentals of FP8 precision and took a deep dive into the various scaling recipes for practical large-scale deep learning. In the event you haven’t read those yet, we recommend starting there for a solid foundation.

This post focuses on what matters most in production: speed. FP8 training guarantees faster computation, but how much real-world acceleration does it actually deliver? And what are the hidden overhead penalties which may diminish these theoretical gains?

We’ll compare the leading FP8 scaling recipes side by side, using real benchmarks on NVIDIA H100 and NVIDIA DGX B200 GPUs. We rigorously evaluate each FP8 recipe using NVIDIA NeMo Framework—from delayed and current scaling to MXFP8 and generic block scaling—by way of training efficiency, numerical stability, hardware compatibility, and scalability as model sizes increase.

By examining each convergence behavior and throughput across diverse LLMs, this post provides clear, actionable insights into how each approach performs in practical, demanding scenarios.

Scaling recipe	Speedup	Numerical stability	Granularity	Really useful models	Really useful hardware
Delayed scaling	High	Moderate	Per tensor	Small dense models	NVIDIA Hopper
Current scaling	High	Good	Per tensor	Medium-sized dense and hybrid models	NVIDIA Hopper
Sub-channel scaling	Medium	High	Custom 2D block of 128×128	MoE models	NVIDIA Hopper and Blackwell
MXFP8	Medium	High	Per 32-value block	All	NVIDIA Blackwell and Grace-Blackwell

Faster Training Throughput in FP8 Precision with NVIDIA NeMo

Why does speedup matter in FP8 training?

What are the strengths and trade-offs of FP8 scaling recipes?

Scaling recipe granularity

Experimental setup

Current scaling recipe

MXFP8 recipe

How does NVIDIA GB200 Grace Blackwell Superchip compare to NVIDIA Blackwell architecture?

Start with practical FP8 training

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Understanding Context and Contextual Retrieval in RAG

The AI Bubble Has a Data Science Escape Hatch

The best way to Create Production-Ready Code with Claude Code

What Makes Quantum Machine Learning “Quantum”?

Feds take notice of iOS vulnerabilities exploited under mysterious circumstances

Faster Training Throughput in FP8 Precision with NVIDIA NeMo

Why does speedup matter in FP8 training?

What are the strengths and trade-offs of FP8 scaling recipes?

Scaling recipe granularity

Experimental setup

Current scaling recipe

MXFP8 recipe

How does NVIDIA GB200 Grace Blackwell Superchip compare to NVIDIA Blackwell architecture?

Start with practical FP8 training

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.