Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core

This post introduces Dynamic Context Parallelism (Dynamic-CP), a scheduling approach in NVIDIA Megatron Core used for LLM post-training or DiT pre-training. It dynamically selects the CP size per microbatch to efficiently handle variable-length sequences, achieving as much as 1.48x speedup on real-world datasets.

In large-scale model training, an often-overlooked bottleneck arises from the sequence-length variability in real-world datasets. Each LLM training and large-scale video generation have clear long-tail distributions in sequence length. A small fraction of ultra-long samples accounts for a disproportionately large share of the computational workload and memory consumption

In LLM training, this results in wide-ranging text sequence lengths across batches. In video generation, high-resolution, multi-second videos can span tens of hundreds of tokens. This ends in imbalanced sample-level FLOPs and memory usage across data-parallel ranks, modalities, and micro-batches, hindering efficient scheduling and resource utilization.

To administer variable-length inputs, training systems commonly use sample-level packing, which mixes multiple shorter sequences right into a single micro-batch whose total token length is bounded by a goal sequence length. In Figure 1, the sequences are packed to an equal length.

Diagram that explains unpacked samples and packed samples. — *Figure 1. Unpacked in comparison with packed sequences*

Although the three packed samples are the identical length, their compute workloads aren’t equivalent, as shown in Figure 2, because of the quadratic nature of dot product attention. This variation in compute workload across packed samples is often known as data-parallel (DP) computational imbalance. This imbalance causes GPU idling, as some DP ranks wait for others with higher compute workloads to perform gradient synchronization. It also exacerbates pipeline-parallel (PP) bubbles.

The attention compute workload per sample in packed sequences varies, leading to GPU idling. — *Figure 2. Attention to compute imbalance amongst packed sequences*

In Figure 3, NVIDIA Nsight Systems profile captures an imbalance in VLM training. Different image/video samples have variable sequence length, and packing is employed. The capture shows sync overhead across different DP groups.

This diagram shows the Nsight System profiling timeline capture, with sync overhead caused by an imbalance in variable sequence lengths. — *Figure 3. Sync overhead across different DP groups*

Also, when using context parallelism, the CP sharding size is set by the longest sequence within the batch to avoid out-of-memory errors across GPUs. In consequence, shorter sequences that don’t require context parallelism are sharded. Although these sequences fit on a single GPU, they’re partitioned because of an extended sequence in the identical batch, leading to unnecessary CP communication overhead.

Normally, computation hides CP communication. Nonetheless, when CP sizes are large—especially when communication spans InfiniBand (IB) domains—communication overhead can develop into exposed when packed sequences are shorter, and the compute workload is smaller. That is CP computational inefficiency.

The next shows an example where TP2CP8 is required because of the massive total sequence length. Many packed sequences are fabricated from smaller sub-sequences and don’t have enough compute to cover the CP communications.

Diagram shows the capture of the Nsight System profiling timeline with large CP for all sequences. For some samples, the NCCL related kernel is longer than the cuDNN related compute kernel. CP communications cannot be hidden by the computation process. — *Figure 4. Insufficient compute to cover CP communication under packing*

These observations show the necessity for a dynamic approach to context parallelism. As an alternative of statically fixing the CP size to the longest sequence in a micro-batch, this approach adapts the CP size using the packing strategy per micro-batch. Relevant work, equivalent to ByteScale and WLB-LLM, addresses similar problems.

Switching the CP size requires re-partitioning the sequence slices and re-forming the CP communication groups utilized by attention operations. In comparison with alternative dynamic-parallelism schemes—equivalent to adapting tensor-parallel or pipeline-parallel sizes based on sequence length—Dynamic-CP adds minimal overhead, because resizing TP/PP requires weight redistribution or pipeline graph restructuring, that are expensive.

The solver is designed to, given a set of variable-length sequences, determine find out how to pack them and choose the CP size to maximise computational efficiency without exceeding GPU memory limits. The solver’s function is to take variable-length sequences and calculate the optimal packing and CP size. This determination maximizes computational efficiency while strictly adhering to GPU memory constraints. By modeling compute and communication costs, the solver avoids over-sharding short sequences and unnecessary CP communication, mitigating data-parallel imbalances and CP inefficiency.

The next example shows the advantage of using Dynamic-CP. Before applying workload balancing, the imbalance results in pipeline bubbles across different micro-batches, which further causes DP imbalance across DP ranks. After balancing, the bubbles across micro-batches and DP ranks reduce.

This diagram shows the advantages of the Dynamic-CP method. Previously, imbalanced bubbles occurred between data-parallel (DP) ranks and micro-batches, and were further amplified throughout the pipeline. The Dynamic-CP method effectively reduces these bubbles. — *Figure 5. Benefits of the Dynamic-CP method*

Model Size	Dataset type	Method	TFLOPS/GPU
Llama 13B	GitHub	Only Packing	195.88
Llama 13B	GitHub	Dynamic CP	289.32
Llama 13B	Commoncrawl	Only Packing	139.17
Llama 13B	Commoncrawl	Dynamic CP	174.39

Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core

Megatron Core framework modifications for supporting Dynamic-CP

Constructing multiple context parallel groups per rank

Dynamic rescheduling and packing data

Broadcasting across pipeline stages and increasing packedSeqParams

Loss computation and FLOPs calculation

Data scheduler modeling

Collaboration of cost model, solver, and simulator

Modeling process and bi-objective balance

Zero-overhead execution

Benchmark results

Learn more

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

Supply-chain attack using invisible code hits GitHub and other repositories

Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core

Megatron Core framework modifications for supporting Dynamic-CP

Constructing multiple context parallel groups per rank

Dynamic rescheduling and packing data

Broadcasting across pipeline stages and increasing packedSeqParams

Loss computation and FLOPs calculation

Data scheduler modeling

Collaboration of cost model, solver, and simulator

Modeling process and bi-objective balance

Zero-overhead execution

Benchmark results

Learn more

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.