Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

Because the sizes of AI models and datasets proceed to extend, relying only on higher-precision BF16 training is not any longer sufficient. Key challenges resembling training throughput expectations, memory limits, and rising costs have gotten the first barriers to scaling transformer models.

Using lower-precision training can address these challenges. By reducing the numeric precision used during computation, GPUs can process more operations per cycle, enhancing training efficiency and lowering costs.

This post compares the next three low-precision training formats directly against established BF16 precision training across multi-hundred-billion token pretraining runs and downstream benchmarks:

We present practical, large-scale results showing how low-precision training delivers as much as ~1.6x higher throughput, substantial memory savings, and near-identical model quality using production-ready recipes you’ll be able to adopt today.

Model	Dataset	Precision	MMLU (↑)	HellaSwag (↑)	WinoGrande (↑)	ARC-C (↑)
Llama 3 8B	DCLM	BF16	45.98	76.44	70.17	51.28
		FP8-CS	46	75.25	70.24	49.91
		MXFP8	46.56	75.46	71.27	51.11
		NVFP4	45.64	75.59	69.38	51.28
Llama 3 8B	Internal dataset	BF16	52.73	75.71	67.88	51.37
		FP8-CS	52.46	75.65	70.17	54.52
		MXFP8	53.7	75.54	69.69	51.62
		NVFP4	52.83	75.04	71.98	53.58
Research-8B	Internal dataset	BF16	53	76.98	70.4	55.89
		FP8-CS	52.62	75.81	70.8	54.44
		MXFP8	52.38	76.55	69.77	53.58
		NVFP4	52.21	76.19	70.32	54.95

Precision	Micro-batch size	Throughput (TFLOP/s/GPU)	Speedup versus BF16
BF16	2	1165	–
FP8-CS (F1L1)	2	1547	1.33x
MXFP8	2	1540	1.32x
NVFP4 (F0L4)	4	1850	1.59x

			Optimizer
Precision	Parameter	Gradients	Momentum	Variance	Master parameter	Others
FP16	FP16	FP32	FP32	FP32	FP32
BF16	BF16	BF16
FP8 (tensor scaling)	FP8x2	BF16				Scaling factor per weight tensor
MXFP8	FP8x2	BF16				(Scaling factor per 32 elements) x 2
NVFP4	FP4	BF16				16×16 2D block scales replicated for every 1×16 block

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

What’s low-precision training?

Low-precision formats

Can low-precision training match BF16 accuracy at scale?

Experimental setup: Isolating the impact of precision

Convergence behavior: Training stability across precisions

Downstream evaluation: Accuracy is preserved

Key insights

Benefits of FP8, MXFP8, and NVFP4 training

Faster end-to-end training

GPU memory savings and higher scalability

Low-precision training with NeMo Megatron Bridge

Train faster and scale efficiently

Start with low-precision training

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

PySpark for Pandas Users

Scaling-up BERT Inference on CPU (Part 1)

The human work behind humanoid robots is being hidden

Using & Mixing Hugging Face Models with Gradio 2.0

GPT-Neo and the 🤗 Accelerated Inference API

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

​​What’s low-precision training?

Low-precision formats

Can low-precision training match BF16 accuracy at scale?

Experimental setup: Isolating the impact of precision

Convergence behavior: Training stability across precisions

Downstream evaluation: Accuracy is preserved

Key insights

Benefits of FP8, MXFP8, and NVFP4 training

Faster end-to-end training

GPU memory savings and higher scalability

Low-precision training with NeMo Megatron Bridge

Train faster and scale efficiently

Start with low-precision training

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What’s low-precision training?

What are your thoughts on this topic?
Let us know in the comments below.