Training massive mixture-of-experts (MoE) models has long been the domain of just a few advanced users with deep infrastructure and distributed-systems expertise. For many developers, the challenge wasn’t constructing smarter models—it was scaling them efficiently across a whole lot and even 1000’s of GPUs without breaking the bank.
With NVIDIA NeMo Automodel, an open-source library inside NVIDIA NeMo framework, developers can now train large-scale MoE models directly in PyTorch—using the identical familiar tools they already know. Built on accelerated PyTorch distributed with NVIDIA performance optimizations, NeMo Automodel democratizes large-scale MoE training—making it:
- Easy – Train billion-parameter models directly in PyTorch without managing complex parallelism or specialized systems.
- Accessible – Empower researchers, startups, and enterprises to experiment with MoE architectures previously out of reach.
- Efficient – Scale from eight to over 1,000 GPUs while maintaining strong performance and cost-effectiveness through built-in optimizations.
On this post, you’ll see how NeMo Automodel combines PyTorch-native distributed parallelism with NVIDIA acceleration to make large-scale MoE training easier, faster, and more accessible than ever. You’ll also find an in depth quick-start guide to breed benchmark results, run your individual experiments, and explore configuration options—so you possibly can experience the advantages firsthand.
Why training large MoEs is tough
Training MoEs efficiently at scale requires solving several interconnected challenges:
- Expert parallelism: Distribute a whole lot of experts across GPUs without overwhelming communication bandwidth.
- Token routing overhead: Move tokens quickly and efficiently to the right experts.
- Memory management: Shard massive parameter sets to suit inside GPU memory constraints.
- Communication-computation fusion: Minimize latency from all-to-all communication and token permutation operations.
Consequently of those system challenges, achieving greater than 150 TFLOPs/GPU on H100 systems at BF16 precision has historically been difficult—leaving performance untapped.
NVIDIA NeMo Automodel, an open-source library inside the NVIDIA NeMo framework, removes these barriers by constructing on top of native PyTorch parallelisms. It incorporates advanced infrastructure optimizations—previously reserved for expert ML engineers—directly into the PyTorch ecosystem.Â
Developers can now use PyTorch APIs while achieving over 200 TFLOPs per GPU on H100s with BF16 precision for quite a lot of popular 100B+ MoE architectures. For example, DeepSeek V3 reached 250 TFLOPs/sec/GPU on 256 GPUs.
This makes large-scale MoE training accessible—empowering the broader community to research, experiment, and innovate with billion-parameter models.Â
Inside NeMo Automodel: architecture and optimizations
NeMo AutoModel bridges PyTorch native distributed parallelisms with NVIDIA acceleration technologies, making a unified, high-performance training stack for MoEs.
Scaling efficiently via PyTorch distributed parallelisms
Built on PyTorch distributed, NeMo Automodel seamlessly scales models using:
Accelerating training with NVIDIA Transformer Engine
Using NVIDIA Transformer Engine kernels—including CUDNN RMSNorm, CUDNN Linear, and DotProductAttention—NeMo Automodel accelerates transformer blocks and supports different attention mechanisms corresponding to multi-head latent attention (MLA), grouped-query attention (GQA), and sliding-window attention (SWA).
Smarter expert routing and computation with Megatron-Core DeepEP and GroupedGEMM
To attain high efficiency at massive scale, NeMo Automodel integrates advanced token routing and expert computation components from Megatron-Core, designed specifically for MOE training.
- DeepEP token dispatcher (Experimental): Scales token routing to 64+ expert parallelism degrees with highly efficient all-to-all communication and optional permute/unpermute fusion. By leveraging DeepSeek’s DeepEP optimization, NeMo Automodel minimizes communication overhead and maintains balanced expert utilization, enabling smoother scaling across a whole lot of GPUs.
- GroupedGEMM for MoE Experts: Aggregates multiple local expert computations right into a single batched GEMM operation. This reduces kernel launches overhead, increases GPU occupancy, and significantly improves throughput and hardware utilization—especially when multiple experts share the identical device..
Breakthrough performance: cost-effective MoE training for everybody
The table below shows pre-training benchmarks on DGX H100 systems with BF16 precision across major MoE architectures:
| Model | #GPUs | GBS (Global Batch Size) | Parallelism [TP,PP,CP,EP,VP, FSDP] |
Optimizations | TFLOPs /sec/GPU |
Tokens/sec/GPU |
|---|---|---|---|---|---|---|
| DeepSeek V3 671B | 256 | 512 | 1,4,1,64,8,64 | TE + DeepEP | 250 | 1,002 |
| DeepSeek V3 671B | 1024 | 8192 | 1,4,1,64,8,256 | TE + DeepEP | 216 | 865 |
| Kimi K2 | 256 | 512 | 1,8,1,32,4,32 | TE + DeepEP | 189 | 924 |
| Qwen3 MoE 30B | 8 | 512 | 1,1,1,8,-,8 | TE + DeepEP | 277 | 12,040 |
| GPT-OSS 20B | 8 | 256 | 1,1,1,-,-,8 | TE + DeepEP + FlexAttn | 279 | 13,058 |
| GPT-OSS 120B | 64 | 512 | 1,1,1,-,-,64 | TE + DeepEP + FlexAttn | 231 | 7,626 |
NeMo Automodel delivers industry-leading efficiency and scalability across diverse MoE architectures and GPU counts. Models sustain 190 to 280 TFLOPs/sec per GPU and process as much as 13,000 tokens/sec, demonstrating near-linear scaling from eight to 1,024 GPUs, with DeepSeek V3 671B model reaching 250 TFLOPs/sec per GPU on 256 GPUs. All of those are done via native PyTorch parallelisms coupled with NVIDIA optimizations, unlocking peak hardware utilization and cost-effective large-scale MoE training for everybody within the PyTorch community.
Empowering developers through native PyTorch distributed training
By leveraging native PyTorch distributed parallelisms, NeMo Automodel brings high-performance large-scale MOE training directly into the PyTorch ecosystem. This approach eliminates dependency on external or proprietary model-parallel libraries, giving developers full flexibility to scale using tools and APIs they already know.
Most significantly, it reflects NVIDIA commitment to strengthening PyTorch and the broader open source AI ecosystem—making large-model training not only faster, but more open, interoperable, and accessible to the whole developer community.
Key advantages for developers:
- Faster iteration cycles: Achieve higher throughput for quicker experimentation and model development.
- Lower training costs: Higher GPU utilization means fewer GPU-hours per training run.
- Scalable performance: Consistent, near-linear scaling from eight GPUs to over 1,000 GPUs enables flexible infrastructure planning.
- Native PyTorch integration: Leveraged PyTorch distributed to remove reliance on external model-parallel frameworks—keeping all the pieces inside the PyTorch workflow.Â
- Ecosystem commitment: Demonstrates NVIDIA long-term investment in advancing PyTorch, ensuring future innovations are directly integrated into the core framework.
- Production-ready: Includes proven, battle-tested configurations for leading open-source MoE architectures.
Quick start: train and benchmark large MoE models
Getting began with NeMo Automodel is fast and familiar for any PyTorch developer.
You need to use the provided benchmark scripts and configuration files to breed results or train your individual large-scale MoE models with NVIDIA-optimized performance.
Minimum requirements
Not less than eight GPUs (80 GB memory each) are really helpful to breed the benchmark results and run fine-tuning experiments efficiently.
Follow these easy steps to run a benchmark or fine-tuning experiment:Â
# 1. Pull the NeMo docker image and begin a container
docker pull nvcr.io/nvidia/nemo:25.09
docker run -it -d --ulimit memlock=-1 --ulimit stack=67108864 --gpus all nvcr.io/nvidian/nemo:25.09 bash
# 2. Once contained in the container, clone the repo and navigate to Automodel
git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel
Run a benchmark
Example: Benchmark Qwen3 MoE 30B on eight GPUs
torchrun --nproc-per-node 8 nemo_automodel/recipes/llm/benchmark.py
--config examples/benchmark/configs/qwen3_moe_30b_te_deepep.yaml
Run fine-tuning
Example: High quality-tune Qwen3 MoE 30B
Note:Â
- You’ll must download the model checkpoint from Hugging Face first:
hf download Qwen/Qwen3-30B-A3B - If you happen to encounter a dataset instantiation error, upgrade the datasets library:
pip install --upgrade datasets
torchrun --nproc-per-node 8 examples/llm_finetune/finetune.py --config examples/llm_finetune/qwen/qwen3_moe_30b_te_deepep.yaml
Available configuration files:
deepseek_v3_te_deepep.yaml– DeepSeek V3 (671B parameter)kimi_k2_te_deepep.yaml– Optimized configuration for Kimi K2Âqwen3_moe_30b_te_deepep.yaml– Qwen3 MoE 30B with full NVIDIA optimizationsgptoss_20b_te_deepep.yaml– GPT-OSS 20B with FlexAttentiongptoss_120b_te_deepep.yaml– GPT-OSS 120B production configuration
Take a look at docs for complete performance documentation and implementation details.
Looking ahead: Join us in advancing open MoE training
This release marks a significant milestone in democratizing large-scale mixture-of-experts (MoE) training with accelerated PyTorch. But it surely’s only the start.
We’re actively working on:
Expanding model support: Adding recent MoE and hybrid architectures.
- Deeper optimizations: Further kernel-level and communication improvements for even higher efficiency.
- Technical deep dives: Detailed explainers of NeMo AutoModel MoE design and performance techniques.
- Broader benchmarking: Extending performance validation across diverse hardware and cluster configurations.
We’d love so that you can start with NeMo Automodel and be a part of this journey—try the configurations, share your results, and contribute feedback through GitHub Issues. Your insights help shape the following generation of scalable, open AI training tools.
