In LLM training, Expert Parallel (EP) communication for hyperscale mixture-of-experts (MoE) models is difficult. EP communication is actually all-to-all, but attributable to its dynamics and sparseness (only topk experts per AI token as an alternative of all experts), it’s difficult to implement and optimize.
This post details an efficient MoE EP communication solution, Hybrid-EP, and its use within the NVIDIA Megatron family of frameworks, on NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet platforms. It also dives into the effectiveness of Hybrid-EP in real-world model training.
Efficiency challenges of hyperscale MoE model training
DeepSeek-V3 is a representative model of the brand new generation of large-scale fine-grained MoE models. Such models balance computational overhead with model performance through “hyperparameter size sparse activation,” but additionally they pose serious challenges for existing large-model training frameworks.
- Communication efficiency bottlenecks: The MoE model relies on parallel experts and requires frequent all-to-all communication. Because the variety of experts increases, the burden of EP communication increases. In DeepSeek-V3, communication time may account for greater than 50% of overall training time without optimization.
- Load imbalance: Dynamic routing mechanisms cause some “hot experts” to receive more tokens than average, while “cold experts” are underutilized, leading to uneven computing load across devices and wasted computing power. This problem becomes more pronounced in a fine-grained scenario where the variety of experts and the variety of activated experts proceed to extend.
- Framework adaptability challenges: Today’s MoE models impose higher and more complex requirements for parallel strategies, low-precision computing, and dynamic resource scheduling. In addition they need optimization to maximise the potential of next-generation hardware architectures similar to NVIDIA Blackwell, NVIDIA Quantum InfiniBand, and NVIDIA Spectrum-X Ethernet.
MoE training framework optimization and communication solution
NVIDIA Megatron Core—an open source and large-scale model training library—is a key foundation for training hyperscale MoE models. Its core advantages include:
- Multidimensional parallelism strategies with support for tensor parallelism (TP), sequence parallelism, pipeline parallelism (PP), MoE expert parallelism (EP), and other strategies that may be flexibly combined to accommodate diverse and complicated training workloads.
- Resource and efficiency optimization integrates FP8 mixed-precision training, activation value offloading, a distributed optimizer, and fine-grained recalculation functions to scale back GPU memory consumption and fully support model training. It integrates multiple efficient operators (similar to MLA, Attention, MLP, etc.) and provides various fusion optimization and pipeline scheduling strategies to enhance computing performance.
- MoE-specific adaptation provides complete support for mainstream MoE models, similar to DeepSeek, Mixtral, and Qwen, with efficient, scalable training.
How Hybrid-EP is an efficient communication optimization solution
Hybrid-EP is a newly designed MoE EP communication library. It uses hardware and software advancements on the NVIDIA platform to attain near-hardware-limits in communication bandwidth and minimize GPU hardware resource usage in RDMA-NVLink hybrid network architectures.
It implements two core operators in MoE EP communication: dispatch, which routes the tokens output by the eye operator to the corresponding experts, and mix, which routes the tokens output by the experts back to the eye operator. Routing and information-processing support can also be added to enable complete EP communication.


The design goals and core optimization directions of Hybrid-EP include leveraging the most recent communication technologies on the NVIDIA platform, similar to TMA commands for data communication on NVLink scale-up networks, and low-level IBGDA network technology for RDMA networks. RDMA and NVLink hybrid network communication maximizes cross-domain bandwidth by combining intra-node NVLink and inter-node RDMA to enhance algorithmic bandwidth. A knowledge pipeline that masks many of the latency of communication and dynamic routing by cutting data into fine-grained chunks and streaming them through multiple levels of communication data pipelines, making EP bandwidth comparable to a highly optimized standard static all-to-all.
Minimized GPU streaming multiprocessor (SM) usage to maximise communication-computation overlap. Hybrid-EP achieves peak communication bandwidth with fewer SMs, leaving more SMs available for computation. Low-precision native support for FP8/ BF16 dispatch operators and BF16 mix operators.
Hybrid-EP designs each CUDA block as an independent data channel that occupies an SM to run a whole data pipeline. Different warp groups inside a CUDA block handle different pipeline stages. CUDA blocks run in parallel and process different data chunks without synchronization and communication between the blocks.


The dotted boxes in Figure 2 represent pipeline stages utilized by RDMA network communication. The RDMA warp group is accountable for transmitting network traffic to RDMA network interface cards (NICs) using IBGDA technology to finish network communication and token data transmission between GPUs on the identical rail and between different nodes (similar to 0 GPUs across all nodes).
The G2S warp group is accountable for reading the token data owned by the GPU and the token data transmitted by the GPU of other nodes on the identical rail into the shared memory First-In, First-Out (FIFO) queue contained in the SM. The S2G warp group writes the token data from the shared memory FIFO queue contained in the SM to the corresponding location within the output buffer of all GPUs on this node (including the GPU itself).
During this process, tokens are routed and transported in response to the data within the routing map to avoid transmitting unwanted token data. Each CUDA block uses this data pipeline to process the token data in the info chunk within the order it’s assigned. Different CUDA blocks handle different data chunks, using the identical data pipeline.


As with the dispatch operator, the dotted frame section is barely used for RDMA network communication. Since the mix operator performs high-accuracy accumulation operations on tokens, that are currently performed only on the CUDA core inside the SM, these operations should be amassed hierarchically.
Within the multi-node case, the relevant intra-node warp groups first complete a portion of the cumulative work for the token inside the node, after which the RDMA warp group sends the partially amassed token to the GPU on the identical rail between nodes. Finally, inter-node-related warp groups complete the worldwide cumulative work to get the end result.
Within the case of a single node, the cumulative work inside the node is completed directly by warp groups related to the inter-node. During this process, the input token is read by the corresponding G2S warp group to the shared memory G2S FIFO queue contained in the SM, after which the corresponding reduction warp group accumulates tokens within the CUDA core. The result’s stored within the shared memory S2G FIFO queue inside the SM and handed over to the TMA unit for saving to the GPU memory.
Hybrid-EP was tested across many hardware platforms with the next test conditions:
- HIDDEN_DIM is 8,192
- DATA_TYPE is BF16, only transfer token.
- NUM_OF_ATTN_TOKENS_PER_RANK is 4,096. NUM_OF_EXPERTS_PER_RANK is 2.
- The routing map is generated randomly from a uniform distribution.
- TOPK is 8.
- Use the NVIDIA Quantum InfiniBand network.
The primary test was run on an NVIDIA DGX Hopper platform with 8 H100 GPUs. Hybrid-EP fills NVLink bandwidth with only eight SMs.


Then, across 4 NVIDIA DGX Hopper Platforms, a complete of 8×4 32-GPU configurations were tested on the cluster. Each of the 4 DGX H100 GPUs on the identical rail had an NVIDIA ConnectX-7 NIC at 400 Gbps, connecting via the NVIDIA Quantum InfiniBand network.
Considering that Hybrid-EP performs hierarchical communication using NVLink RDMA hybrid networks, two sets of knowledge were counted throughout the test:
- The actual speed achieved on the ConnectX-7 NIC is calculated using the quantity of knowledge passing through it and the entire communication time, i.e., the NIC bus bandwidth.
- Global bandwidth is calculated on the algorithm level and measures the bandwidth that dispatch and mix operators use in a hybrid network, i.e., the algorithm bandwidth.
Hybrid-EP requires only about 4 SMs to approach the NIC’s maximum bandwidth.


Finally, Hybrid-EP performance in large-scale NVLink networks on the NVIDIA Grace Blackwell was tested. The NVLink domain size used 36 GPUs, which is a GB200NVL36. Hybrid-EP requires only 16 SMs to fill NVLink bandwidth.


Practical cases: Combined hotspot model and hardware landing verification
Hybrid-EP is predicated on templates and CUDA C implementations with each input and output buffer addresses. Some additional integration work is required to make use of Hybrid-EP within the PyTorch-based Megatron Core framework. It’s now available within the DeepEP/Hybrid-EP Branch, and provides directly callable PyTorch operators, convenient for users to quickly complete integration and testing.
For the reason that Hybrid-EP kernel only accepts pointer parameters and isn’t accountable for memory management, it’s mandatory to design an affordable buffer management and allocation mechanism. Depending on the usage scenario, Hybrid-EP buffers may be roughly divided into two categories:
- Registered buffer: Refers to specially registered GPU memory that may be accessed by kernels on other ranks. It’s the only global static buffer. Registration relies on the scenario: cross-node communication registers GPU memory with the communication memory region, while non-cross-node communication uses a driver-API handle resolvable by other ranks.
- Normal buffer: Refers to GPU memory allocated with cudaMalloc, which may be managed by PyTorch’s allocator and is normally not globally unique.
Because buffer application and registration operations are time-consuming, ideally, they’re accomplished only throughout the Python Hybrid-EP initialization phase. Nevertheless, the MoE model is dynamic, and the variety of tokens received by the present rank in each iteration varies, changing the specified buffer size. To do that, use the worst-case preallocation strategy and apply a big buffer on the upper limit so that each one tokens can converge to the identical rank. Because this buffer is exclusive on a worldwide scale, overall GPU memory usage stays controllable.






The PyTorch environment, Hybrid-EP’s workflow is shown in Figure 10. After preprocessing, synchronization is required because Torch needs GPU-side results to find out subsequent tensor sizes, while Hybrid-EP computes buffer sizes within the preprocessing kernel. This sync may be avoided if the host predefines a sufficiently large buffer size.


Optimization practices on Grace Blackwell
Megatron Core has integrated Hybrid-EP on the Grace Blackwell platform and may be optimized for various kinds of MoE models.
| Model | Precision | Dispatcher | TFLOPS/GPU | Speedup |
| DeepSeek-V3 | MXFP8 | DeepEP | 829 | 1x |
| MXFP8 | Hybrid-EP | 943 | 1.14x | |
| DeepSeek-V3- FSDP | MXFP8 | A2A | 597 | 1x |
| MXFP8 | Hybrid-EP | 645 | 1.08x | |
| Qwen 3 235B | BF16 | A2A | 665 | 1x |
| BF18 | Hybrid-EP | 698 | 1.05x | |
| MXFP8 | A2A | 728 | 1x | |
| MXFP8 | Hybrid-EP | 800 | 1.10x |
The outcomes include:
- DeepSeek-V3, 256 experts, topk-8 scenarios, using hybrid EP achieves about 14% performance improvement over DeepEP without adding MTP.
- Megatron-FSDP, Hybrid EP still delivers about 8% performance improvement
- Qwen 3 235B scenario, there’s a 5.5% improvement within the BF16 scene, and about 9.9% improvement on the MXFP8.
- Megatron-FSDP, Hybrid EP still delivers about 8% performance improvement
Learn more about how NVIDIA is enabling 10x performance and 1/10 cost for deploying MoE models.
