Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

Modern AI workloads have moved well beyond single-GPU inference serving. Model parallelism, which efficiently splits computation across many GPUs, is now the muse of scalable, state-of-the-art deployments. The best-performing models increasingly adopt mixture-of-experts (MoE) architectures, that are more efficient than dense models because they activate only a subset of trained parameters per token. Nevertheless, scaling MoEs introduces more complex parallelism, communication, and scheduling requirements that should be rigorously optimized.

Expert parallelism (EP), the strategic distribution of experts across multiple GPUs, is important to overcoming these challenges and unlocking scalable performance. As models like DeepSeek-R1, with 256 experts and 671 billion parameters, proceed to grow, recent tools are needed—equivalent to NVIDIA Tensor RT-LLM’s Wide Expert Parallelism, or Wide-EP. It makes large-scale deployment more efficient, improving each performance and total cost of ownership.

On this blog, we break down how large-scale EP impacts performance and reshapes inference economics within the NVL72 rack-scale domain.

Expert parallelism (EP) is a model-parallel technique that distributes a MoE model’s experts across multiple GPUs to reap the benefits of combined compute and memory bandwidth. At smaller scales, EP helps reduce memory pressure and keep utilization high by balancing work across devices.

Diagram comparing small scale and large scale Expert Parallelism (EP) in mixture-of-experts (MoE) layers. Small scale EP shows multiple experts packed into each GPU across a few GPUs, while large scale EP distributes fewer experts across a large number of GPUs, illustrating how expert distribution changes to support efficient inference on bigger clusters. — *Figure 1. Animation showing how small-scale EP deploys many experts per GPU, while large-scale EP spreads fewer experts per GPU across a much larger cluster, enabling efficient scaling of MoE layers.*

As models like DeepSeek-R1 grow to a whole lot of billions of parameters with a whole lot of experts, these same techniques must expand in scope, resulting in what we call large-scale EP. For the needs of this blog, large-scale EP refers back to the means of distributing experts across eight or more GPUs. This increases aggregated bandwidth for faster weight loading and supports larger effective batch sizes to enhance overall GPU utilization.

Component	NVIDIA Dynamo	TensorRT-LLM Wide-EP
Role	Orchestration layer for disaggregated inference	Execution engine for expert-parallel decoding
Optimization Scope	Orchestrates prefill & decode phases across GPU pools	Distributes small variety of experts per GPU to optimize per token memory and compute utilization
SLA Awareness	SLA-aware autoscaling and dynamic rate matching (TTFT & ITL)	Maximizes batching & minimizes latency through efficient expert scheduling
Traffic Adaption	Reacts in real-time to ISL/OSL fluctuations via the Dynamo Planner	Load balances expert allocations to optimize compute utilization
Hardware Synergy	Scales via Kubernetes + Planner logic across disaggregated GPU domains	Leverages high-bandwidth domains (e.g. NVL72) for efficient expert communication

Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

What are memory and compute challenges of large-scale EP?

What’s the system design and architecture?

Alleviating distributed expert communication overhead with NVLink

Optimizing kernels for optimal expert routing with NCCL

Load balancing wide experts

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Train 400x faster Static Embedding Models with Sentence Transformers

How I Optimized My Leaf Raking Strategy Using Linear Programming

Use any timm model with transformers

Six Lessons Learned Constructing RAG Systems in Production

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

What are memory and compute challenges of large-scale EP?

What’s the system design and architecture?

Alleviating distributed expert communication overhead with NVLink

Optimizing kernels for optimal expert routing with NCCL

Load balancing wide experts

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.