Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

-


Modern AI workloads have moved well beyond single-GPU inference serving. Model parallelism, which efficiently splits computation across many GPUs, is now the muse of scalable, state-of-the-art deployments. The best-performing models increasingly adopt mixture-of-experts (MoE) architectures, that are more efficient than dense models because they activate only a subset of trained parameters per token. Nevertheless, scaling MoEs introduces more complex parallelism, communication, and scheduling requirements that should be rigorously optimized.

Expert parallelism (EP), the strategic distribution of experts across multiple GPUs, is important to overcoming these challenges and unlocking scalable performance. As models like DeepSeek-R1, with 256 experts and 671 billion parameters, proceed to grow, recent tools are needed—equivalent to NVIDIA Tensor RT-LLM’s Wide Expert Parallelism, or Wide-EP. It makes large-scale deployment more efficient, improving each performance and total cost of ownership. 

On this blog, we break down how large-scale EP impacts performance and reshapes inference economics within the NVL72 rack-scale domain.

Expert parallelism (EP) is a model-parallel technique that distributes a MoE model’s experts across multiple GPUs to reap the benefits of combined compute and memory bandwidth. At smaller scales, EP helps reduce memory pressure and keep utilization high by balancing work across devices. 

Diagram comparing small scale and large scale Expert Parallelism (EP) in mixture-of-experts (MoE) layers. Small scale EP shows multiple experts packed into each GPU across a few GPUs, while large scale EP distributes fewer experts across a large number of GPUs, illustrating how expert distribution changes to support efficient inference on bigger clusters.Diagram comparing small scale and large scale Expert Parallelism (EP) in mixture-of-experts (MoE) layers. Small scale EP shows multiple experts packed into each GPU across a few GPUs, while large scale EP distributes fewer experts across a large number of GPUs, illustrating how expert distribution changes to support efficient inference on bigger clusters.
Figure 1. Animation showing how small-scale EP deploys many experts per GPU, while large-scale EP spreads fewer experts per GPU across a much larger cluster, enabling efficient scaling of MoE layers.

As models like DeepSeek-R1 grow to a whole lot of billions of parameters with a whole lot of experts, these same techniques must expand in scope, resulting in what we call large-scale EP. For the needs of this blog, large-scale EP refers back to the means of distributing experts across eight or more GPUs. This increases aggregated bandwidth for faster weight loading and supports larger effective batch sizes to enhance overall GPU utilization.

What are memory and compute challenges of large-scale EP?

MoE models provide the additional benefit of only activating a small subset of experts during inference—significantly reducing the per token compute requirement. To realize this, MoEs dynamically load the weights of an activated expert on a per token per layer basis. In high throughput, latency-constrained scenarios, weight-loading overhead can quickly change into a serious bottleneck for a selected sort of compute process called MoE GroupGEMMs. 

MoE GroupGEMMs are like sending all tokens to the identical checkout lane at the identical time, in order that they could be processed in a single efficient batch. In practice, they’re grouped matrix multiplications that batch tokens per expert right into a single large calculation. That enhances arithmetic intensity, however it requires loading each expert’s weights into on-chip memory/registers before multiplication.

Diagram illustrating how input tokens sent to a GPU are routed to multiple experts; tokens for the same expert are packed into a matrix and processed by that expert’s feedforward block. The MoE GroupGEMM kernel executes all expert blocks in one step, loading weights into shared memory and reusing them for efficient, high-throughput computation across multiple tokens.Diagram illustrating how input tokens sent to a GPU are routed to multiple experts; tokens for the same expert are packed into a matrix and processed by that expert’s feedforward block. The MoE GroupGEMM kernel executes all expert blocks in one step, loading weights into shared memory and reusing them for efficient, high-throughput computation across multiple tokens.
Figure 2. Tokens routed to the identical expert are packed together and processed with a single fused GroupGEMM kernel for efficient MoE inference.

Large-scale EP addresses a few of the MoE GroupGEMM bottlenecks by introducing more GPUs into the expert parallel configuration, efficiently reducing the variety of experts held by each GPU. This leads to:

  • Less weight-loading pressure (smaller set of expert weights per GPU)
  • Easier reuse of weights by the GroupGEMM kernel (higher arithmetic intensity—more FLOPs per byte of weight loaded)
  • Higher compute/memory balance contained in the kernel

While large-scale EP helps address the restrictions of small-scale EP, it also introduces recent system-level constraints that make scaling large MoEs difficult. TensorRT-LLM Wide-EP helps address these constraints by targeting compute and memory bottlenecks algorithmically while also tackling workload management on the system and architecture level. 

Let’s examine how wide-EP, when paired with GB200 NVL72, provides the muse for scalable and efficient MoE inference.

What’s the system design and architecture?

Scaling expert parallelism requires greater than adding GPUs. It will depend on system design and architecture that keep memory movement and communication efficient. Interconnect bandwidth and topology provide the muse, allowing activations and weights to flow easily across devices. 

On top of this, optimized software and kernels manage expert-to-expert traffic with communication primitives, bandwidth-aware scheduling, and cargo balancing. Together, these capabilities make large-scale EP practical and efficient.

Considered one of the largest bottlenecks in large-scale EP is communication overhead. In the course of the decode phase of inference, distributed experts must exchange information to consolidate the outputs of multiple GPUs across the system. As an example, when distributing DeepSeek-R1’s 256 experts across 64 GPUs with eight energetic experts per token (See Figure 3 below), the communication cost will depend on which experts are activated at a given layer and where their weights are positioned.

Diagram of a large-scale MoE (Mixture-of-Experts) inference setup showing a router assigning tokens to GPUs, each with 232 expert blocks (spanning 58 layers, four experts per layer at EP=64). Only four experts are active per layer, and all GPUs are interconnected in a NVLink domain using GB200 NVL72 trays and switches for efficient expert parallelism and high-bandwidth communication in next-gen AI data center infrastructure.Diagram of a large-scale MoE (Mixture-of-Experts) inference setup showing a router assigning tokens to GPUs, each with 232 expert blocks (spanning 58 layers, four experts per layer at EP=64). Only four experts are active per layer, and all GPUs are interconnected in a NVLink domain using GB200 NVL72 trays and switches for efficient expert parallelism and high-bandwidth communication in next-gen AI data center infrastructure.
Figure 3. Schematic diagram showing an MoE deployment with 232 experts per GPU and only 4 activated per layer, coordinated across 72 GPUs in a GB200 NVL72 NVLink domain.

While large-scale EP reduces weight-loading overhead for activated experts, these gains could be offset by token-gather collectives that must consolidate distributed outputs and reorder tokens before passing them to the subsequent transformer block or the ultimate softmax layer. Without the 130 TB/s of aggregate bandwidth provided by the NVL72, the complexity and overhead of this communication pattern would make large-scale EP impractical.

Optimizing kernels for optimal expert routing with NCCL

MoEs leverage a routing mechanism to dynamically select probably the most appropriate experts per token. Which means every transformer block requires per token dispatching and aggregation after they go through expert layers. The all-to-all operations involved can quickly saturate an already memory-bound decode phase. 

To deal with these challenges, custom EP communication kernels are required. For GB200 NVL72, we’ve got implemented custom kernels to deal with CUDA graph compatibility with multiple rack-scale deployment scenarios. Of note are custom high-performance NCCL kernels designed to handle non-static data sizes across large-scale EP deployments. These custom EP kernels are able to just accept communication sizes directly from GPU memory and reap the benefits of the NVL72 aggregate memory. 

Load balancing wide experts

Load balancing is a classic distributed systems technique that assigns work based on resource availability to maximise utilization without overloading any single a part of the system. Within the case of large-scale EP workloads, load balancing is used to distribute experts among the many available GPUs. For instance, in a GB200 NVL72 rack running Wide-EP DeepSeek-R1 with EP=64 (for clean division), we’d distribute 4 experts per GPU per layer, for a complete of 232 experts assigned per GPU.

To forestall load-balancing scenarios where a set of very fashionable “hot experts” all sit on the identical GPU while other GPUs with less popular “cold experts” sit idle, Wide-EP’s Expert Parallel Load Balancer (EPLB) leverages a policy to redistribute hot experts alongside cold experts. This triggers a weight update process, addressed by utilizing a containerized design that permits experts to flow out and in of container allocations without breaking the CUDA graph. These weight updates are performed in a non-blocking fashion by scheduling them between forward passes.

Diagram comparing expert container placement across three GPUs. Before live EPLB, experts are unevenly distributed—GPU 2 is overloaded, GPU 3 underutilized. After live EPLB, experts are relocated across GPUs for balanced computation. A horizontal bar at the bottom visualizes cold-to-hot (underloaded to overloaded) GPU status, highlighting the improvement in resource balancing with EPLB.Diagram comparing expert container placement across three GPUs. Before live EPLB, experts are unevenly distributed—GPU 2 is overloaded, GPU 3 underutilized. After live EPLB, experts are relocated across GPUs for balanced computation. A horizontal bar at the bottom visualizes cold-to-hot (underloaded to overloaded) GPU status, highlighting the improvement in resource balancing with EPLB.
Figure 4. Diagram showing Expert Parallel Load Balancer (EPLB) redistributes experts to make sure balanced GPU workload, stopping over- and under-utilization.

The EPLB can operate in two different modes: 

  • Static EPLB: pre-computed expert-to-GPU mappings based on historical data patterns are used to optimize expert allocation.
  • Online EPLB: Experts are redistributed during runtime dynamically to adapt real-time to changing workload patterns. 

While static EPLB offers good baseline improvements over a non-EPLB approach, online EPLB provides the best potential for optimal load balancing in real-time production systems. In our initial implementation of online EPLB, we encountered and patched several critical challenges related to real-time weight-updating processes.

When deploying MoE models like DeepSeek R1 or Llama 4 at scale, inference performance hinges on two key pillars: disaggregated serving and Wide-EP. NVIDIA Dynamo and TensorRT-LLM form the software backbone that permits each, transforming traditional bottlenecks into opportunities for large throughput gains and efficient GPU utilization. The table below outlines the differences and synergies between Dynamo and Wide-EP.

Component NVIDIA Dynamo TensorRT-LLM Wide-EP
Role Orchestration layer for disaggregated inference Execution engine for expert-parallel decoding
Optimization Scope Orchestrates prefill & decode phases across GPU pools Distributes small variety of experts per GPU to optimize per token memory and compute utilization
SLA Awareness SLA-aware autoscaling and dynamic rate matching (TTFT & ITL)  Maximizes batching & minimizes latency through efficient expert scheduling
Traffic Adaption Reacts in real-time to ISL/OSL fluctuations via the Dynamo Planner Load balances expert allocations to optimize compute utilization
Hardware Synergy Scales via Kubernetes + Planner logic across disaggregated GPU domains Leverages high-bandwidth domains (e.g. NVL72) for efficient expert communication
Table 1. Comparison of NVIDIA Dynamo and TensorRT-LLM Wide-EP for expert-parallel inference, highlighting roles, optimization scope, SLA awareness, traffic adaption, and hardware synergy.

For more insights into the relationships between NVIDIA Dynamo and TensorRT-LLM Wide-EP, we encourage you to review our blog on leveraging NVIDIA Dynamo for large-scale expert parallelism. 

When you’ve gotten access to the coherent memory domain created by NVLink scale-up in an GB200 NVL72 rack, optimizing large-scale EP comes right down to just a few critical aspects:

  • Model size and variety of experts: Smaller models with fewer experts gain less from Wide-EP because communication overhead can outweigh the advantages of reduced weight loading and distributed compute.
  • System latency and concurrency goals: Large-scale EP is only when throughput is constrained by latency, allowing for greater per GPU throughput at iso-latency. 
  • Hardware capabilities: Aggregate memory bandwidth, inter-GPU bandwidth, and achievable compute determine whether the system can reach the optimal degree of parallelism.

In practice, models like DeepSeek-R1 are strong candidates for large-scale EP, where TensorRT-LLM’s Wide-EP on GB200 NVL72 rack-scale systems delivers the perfect balance of efficiency and throughput. The Pareto frontiers below highlight performance across different EP configurations.

Bar chart comparing small EP rank 8 and large EP rank 32 on DeepSeek-R1, showing that large EP rank 32 achieves 1.8 times more output tokens per second per GPU than small EP rank 8, at a consistent user throughput of 100 tokens per second.Bar chart comparing small EP rank 8 and large EP rank 32 on DeepSeek-R1, showing that large EP rank 32 achieves 1.8 times more output tokens per second per GPU than small EP rank 8, at a consistent user throughput of 100 tokens per second.
Figure 5. Large-scale Expert Parallelism (EP) rank 32 delivers as much as 1.8x higher output token throughput per GPU in comparison with small EP rank 8 at 100 tokens/sec per user. Each configurations leverage disaggregated serving and multi-token prediction (MTP).

In comparison with the small EP configuration (EP8), the massive EP configuration (EP32) achieves as much as 1.8x more per-GPU throughput. This highlights the performance uplift opportunity from leveraging large-scale EP and Wide-EP. A further opportunity exists to leverage speculative decoding with multi-token prediction (MTP) to spice up per-user token throughput—this functionality is already compatible with Wide-EP.

Wide-EP on GB200 NVL72 provides a practical path to scaling large MoE models. Distributing experts across more GPUs reduces weight-loading pressure, improves GroupGEMM efficiency, and leverages GB200 NVL72’s 130 TB/s coherent NVLink domain to offset communication overhead. In testing, large EP configurations reached as much as 1.8x higher per-GPU throughput than smaller EP setups. These gains shift the balance of throughput, latency, and utilization in favor of more efficient large-scale inference.

The broader impact is on system economics. By enabling higher concurrency and stronger GPU efficiency, Wide-EP on NVL72 improves tokens/second/GPU and lowers the general cost of serving large models. For developers, this implies exploring Wide-EP in TensorRT-LLM to search out optimal configurations. For researchers, it creates room to refine scheduling, load balancing, and decoding strategies. For infrastructure teams, it highlights how GB200 NVL72 can change the TCO profile of trillion-parameter deployments.

For more, try how large-scale EP with GB200 NVL72 led to the bottom TCO of all other system architectures in the newest InferenceMAX benchmarks.

And for up-to-date performance insights try the NVIDIA Inference Performance dashboard.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x