Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

In LLM training, Expert Parallel (EP) communication for hyperscale mixture-of-experts (MoE) models is difficult. EP communication is actually all-to-all, but attributable to its dynamics and sparseness (only topk experts per AI token as an alternative of all experts), it’s difficult to implement and optimize.

This post details an efficient MoE EP communication solution, Hybrid-EP, and its use within the NVIDIA Megatron family of frameworks, on NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet platforms. It also dives into the effectiveness of Hybrid-EP in real-world model training.

Model	Precision	Dispatcher	TFLOPS/GPU	Speedup
DeepSeek-V3	MXFP8	DeepEP	829	1x
DeepSeek-V3	MXFP8	Hybrid-EP	943	1.14x
DeepSeek-V3- FSDP	MXFP8	A2A	597	1x
DeepSeek-V3- FSDP	MXFP8	Hybrid-EP	645	1.08x
Qwen 3 235B	BF16	A2A	665	1x
	BF18	Hybrid-EP	698	1.05x
	MXFP8	A2A	728	1x
	MXFP8	Hybrid-EP	800	1.10x

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

Efficiency challenges of hyperscale MoE model training

MoE training framework optimization and communication solution

How Hybrid-EP is an efficient communication optimization solution

Practical cases: Combined hotspot model and hardware landing verification

Optimization practices on Grace Blackwell

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

DenseNet Paper Walkthrough: All Connected

I Replaced Vector DBs with Google’s Memory Agent Pattern for my notes in Obsidian

AI just made the billion-dollar solo founder real

Bringing AI Closer to the Edge and On-Device with Gemma 4

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

Efficiency challenges of hyperscale MoE model training

MoE training framework optimization and communication solution

How Hybrid-EP is an efficient communication optimization solution

Practical cases: Combined hotspot model and hardware landing verification

Optimization practices on Grace Blackwell

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.