Accelerating Large-Scale Mixture-of-Experts Training in PyTorch

Training massive mixture-of-experts (MoE) models has long been the domain of just a few advanced users with deep infrastructure and distributed-systems expertise. For many developers, the challenge wasn’t constructing smarter models—it was scaling them efficiently across a whole lot and even 1000’s of GPUs without breaking the bank.

With NVIDIA NeMo Automodel, an open-source library inside NVIDIA NeMo framework, developers can now train large-scale MoE models directly in PyTorch—using the identical familiar tools they already know. Built on accelerated PyTorch distributed with NVIDIA performance optimizations, NeMo Automodel democratizes large-scale MoE training—making it:

Easy – Train billion-parameter models directly in PyTorch without managing complex parallelism or specialized systems.
Accessible – Empower researchers, startups, and enterprises to experiment with MoE architectures previously out of reach.
Efficient – Scale from eight to over 1,000 GPUs while maintaining strong performance and cost-effectiveness through built-in optimizations.

On this post, you’ll see how NeMo Automodel combines PyTorch-native distributed parallelism with NVIDIA acceleration to make large-scale MoE training easier, faster, and more accessible than ever. You’ll also find an in depth quick-start guide to breed benchmark results, run your individual experiments, and explore configuration options—so you possibly can experience the advantages firsthand.

Model	#GPUs	GBS (Global Batch Size)	Parallelism [TP,PP,CP,EP,VP, FSDP]	Optimizations	TFLOPs /sec/GPU	Tokens/sec/GPU
DeepSeek V3 671B	256	512	1,4,1,64,8,64	TE + DeepEP	250	1,002
DeepSeek V3 671B	1024	8192	1,4,1,64,8,256	TE + DeepEP	216	865
Kimi K2	256	512	1,8,1,32,4,32	TE + DeepEP	189	924
Qwen3 MoE 30B	8	512	1,1,1,8,-,8	TE + DeepEP	277	12,040
GPT-OSS 20B	8	256	1,1,1,-,-,8	TE + DeepEP + FlexAttn	279	13,058
GPT-OSS 120B	64	512	1,1,1,-,-,64	TE + DeepEP + FlexAttn	231	7,626

Accelerating Large-Scale Mixture-of-Experts Training in PyTorch

Why training large MoEs is tough

Inside NeMo Automodel: architecture and optimizations

Scaling efficiently via PyTorch distributed parallelisms

Accelerating training with NVIDIA Transformer Engine

Smarter expert routing and computation with Megatron-Core DeepEP and GroupedGEMM

Breakthrough performance: cost-effective MoE training for everybody

Empowering developers through native PyTorch distributed training

Quick start: train and benchmark large MoE models

Looking ahead: Join us in advancing open MoE training

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes

Post-Training GUI Agents for Computer Use

UK government will buy tech to spice up AI sector in $130M growth push

Swift Transformers Reaches 1.0 – and Looks to the Future

OpenAI braces for “rough vibes”

Accelerating Large-Scale Mixture-of-Experts Training in PyTorch

Why training large MoEs is tough

Inside NeMo Automodel: architecture and optimizations

Scaling efficiently via PyTorch distributed parallelisms

Accelerating training with NVIDIA Transformer Engine

Smarter expert routing and computation with Megatron-Core DeepEP and GroupedGEMM

Breakthrough performance: cost-effective MoE training for everybody

Empowering developers through native PyTorch distributed training

Quick start: train and benchmark large MoE models

Looking ahead: Join us in advancing open MoE training

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.