This post introduces Dynamic Context Parallelism (Dynamic-CP), a scheduling approach in NVIDIA Megatron Core used for LLM post-training or DiT pre-training. It dynamically selects the CP size per microbatch to efficiently handle variable-length sequences, achieving as much as 1.48x speedup on real-world datasets.
In large-scale model training, an often-overlooked bottleneck arises from the sequence-length variability in real-world datasets. Each LLM training and large-scale video generation have clear long-tail distributions in sequence length. A small fraction of ultra-long samples accounts for a disproportionately large share of the computational workload and memory consumption
In LLM training, this results in wide-ranging text sequence lengths across batches. In video generation, high-resolution, multi-second videos can span tens of hundreds of tokens. This ends in imbalanced sample-level FLOPs and memory usage across data-parallel ranks, modalities, and micro-batches, hindering efficient scheduling and resource utilization.
To administer variable-length inputs, training systems commonly use sample-level packing, which mixes multiple shorter sequences right into a single micro-batch whose total token length is bounded by a goal sequence length. In Figure 1, the sequences are packed to an equal length.


Although the three packed samples are the identical length, their compute workloads aren’t equivalent, as shown in Figure 2, because of the quadratic nature of dot product attention. This variation in compute workload across packed samples is often known as data-parallel (DP) computational imbalance. This imbalance causes GPU idling, as some DP ranks wait for others with higher compute workloads to perform gradient synchronization. It also exacerbates pipeline-parallel (PP) bubbles.


In Figure 3, NVIDIA Nsight Systems profile captures an imbalance in VLM training. Different image/video samples have variable sequence length, and packing is employed. The capture shows sync overhead across different DP groups.


Also, when using context parallelism, the CP sharding size is set by the longest sequence within the batch to avoid out-of-memory errors across GPUs. In consequence, shorter sequences that don’t require context parallelism are sharded. Although these sequences fit on a single GPU, they’re partitioned because of an extended sequence in the identical batch, leading to unnecessary CP communication overhead.
Normally, computation hides CP communication. Nonetheless, when CP sizes are large—especially when communication spans InfiniBand (IB) domains—communication overhead can develop into exposed when packed sequences are shorter, and the compute workload is smaller. That is CP computational inefficiency.
The next shows an example where TP2CP8 is required because of the massive total sequence length. Many packed sequences are fabricated from smaller sub-sequences and don’t have enough compute to cover the CP communications.


These observations show the necessity for a dynamic approach to context parallelism. As an alternative of statically fixing the CP size to the longest sequence in a micro-batch, this approach adapts the CP size using the packing strategy per micro-batch. Relevant work, equivalent to ByteScale and WLB-LLM, addresses similar problems.
Switching the CP size requires re-partitioning the sequence slices and re-forming the CP communication groups utilized by attention operations. In comparison with alternative dynamic-parallelism schemes—equivalent to adapting tensor-parallel or pipeline-parallel sizes based on sequence length—Dynamic-CP adds minimal overhead, because resizing TP/PP requires weight redistribution or pipeline graph restructuring, that are expensive.
The solver is designed to, given a set of variable-length sequences, determine find out how to pack them and choose the CP size to maximise computational efficiency without exceeding GPU memory limits. The solver’s function is to take variable-length sequences and calculate the optimal packing and CP size. This determination maximizes computational efficiency while strictly adhering to GPU memory constraints. By modeling compute and communication costs, the solver avoids over-sharding short sequences and unnecessary CP communication, mitigating data-parallel imbalances and CP inefficiency.
The next example shows the advantage of using Dynamic-CP. Before applying workload balancing, the imbalance results in pipeline bubbles across different micro-batches, which further causes DP imbalance across DP ranks. After balancing, the bubbles across micro-batches and DP ranks reduce.


Megatron Core framework modifications for supporting Dynamic-CP
This section introduces the pipeline of Dynamic CP integration into Megatron Core.


Constructing multiple context parallel groups per rank
With standard context parallelism, each rank belongs to a single group (cp_group) with a set cp_size that’s statically determined during initialization. Nonetheless, dynamic context parallelism has a unique cp_size across iterations and microbatches.
To support this, a single rank must take part in multiple CP groups of various sizes. Multiple CP groups are constructed during initialization, with cp_size starting from 1 as much as dp × cp, restricted to powers of two. This design enables choosing the suitable CP group at runtime based on the packing and scheduling result, without the overhead of dynamically creating communication groups.
Dynamic rescheduling and packing data
Unlike pretraining, which generally uses the Batch × Sequence × Head × Dim (BSHD) layout, Dynamic-CP operates on a THD layout. On this format, variable-length sequences are packed together under a length constraint, collapsing the unique BS dimension right into a token T dimension.
As a consequence, the variety of micro-batches is not any longer static. Within the BSHD, the variety of micro-batches is given by num_micro_batches = global_batch_size/dp_size/micro_batch_size.
With THD packing, the variety of original sequences contained in each packed sequence isn’t fixed, causing num_micro_batches to differ across iterations.
Megatron Core provides multiple training schedulers depending on whether pipeline parallelism (PP/VPP) is enabled. To reduce invasive changes to the present scheduling logic, a light-weight data_iterator_wrapper around the unique data_iterator is introduced. It performs three steps:
- Rescheduling and packing sequences in the worldwide batch create a balanced workload across DP ranks.
- Choosing an appropriate
cp_sizebased on the packing result to attenuate CP comm inefficiency. - Returning the effective
num_micro_batchesfor the present iteration.
With this approach, Dynamic-CP support is added to all schedulers by inserting a single wrapper, keeping the unique scheduling code largely intact.
Broadcasting across pipeline stages and increasing packedSeqParams
Since num_micro_batches varies and only TP rank 0, and the primary and last PP stages handle scheduling in Megatron Core, the framework broadcasts num_micro_batches, max_seqlen, and cu_seqlens to all relevant PP ranks. This ensures consistent execution across the pipeline stages under dynamic micro-batch scheduling.
With Dynamic-CP, the effective cp_size can vary between iterations, making it unsafe to depend on globally static CP settings. To handle this, PackedSeqParams extends to hold each cp_size and cp_group.
All components that depend upon context parallelism—equivalent to position embedding and Transformer Engine attention—now retrieve the CP configuration from PackedSeqParams, replacing the unique global CP variables. This guarantees that each one CP-related operations remain consistent with the dynamically chosen CP layout.
Loss computation and FLOPs calculation
Given variable-length sequences and the THD layout, different sequences contribute different numbers of valid tokens. In consequence, loss computation on a per-token basis: loss = loss_over_valid_tokens / total_number_valid_tokens. It avoids bias introduced by padding tokens.
Previous versions of Megatron Core didn’t account for the THD layout and assumed max_seqlen is the effective sequence length when computing FLOPs. Resulting in systematic overestimation in variable-length scenarios.
Data scheduler modeling
Transformer workload scales quadratically with sequence length S . At the identical time, activation memory grows linearly
, meaning even small variances can result in major imbalances in compute and memory across DP ranks and micro-batches. To balance a big sample’s workload, we may pack small samples together, but this causes severe memory pressure. It’s unimaginable to equalize FLOPs and memory concurrently, which drives the scheduling and packing strategies.
Their goal is to approximate a super, balanced distribution through which workload and memory are evenly split across DP ranks and micro-batches. With a set variety of micro-batches per DP rank, a goal workload and memory quota are set for every micro-batch. A 3-stage scheduler then alternates between workload and memory objectives, increasing CP size for heavier samples as needed. compute and memory balance.
Collaboration of cost model, solver, and simulator
An entire scheduler workflow consists of three components:
- The cost model estimates execution time for every sample based on its sequence length, modeling the per-sample workload across transformer operations. This defines the essential load unit, and its accuracy impacts final performance gains.
- The solver uses the fee model output as input and applies a heuristic algorithm to find out a near-optimal packing strategy for every sample. The packed samples are then grouped into micro-batches and assigned a context-parallel CP size. The variety of micro-batches per DP rank impacts the outcomes, pipeline bubbles, pipeline-parallel imbalance bubbles, and data-parallel imbalance bubbles. Iterating over different microbatch counts per DP rank yields the very best final result.
- The simulator evaluates these micro-batches under the distributed pipeline parallel schedule. It selects the plan with the minimum execution time (i.e., probably the most balanced workload) that also satisfies peak-memory constraints.
Modeling process and bi-objective balance
The best balanced distribution is an evenly split workload and memory across different DP ranks and different micro-batches. Given the identical variety of micro-batches across different DP ranks, the goal workload and memory quotafor each micro-batch is set. The pipeline bubble also differs and wishes to be distributed evenly to every microbatch for the end-to-end balance.
Equalizing the end-to-end training time across DP ranks suggests:
The workload quotas across ranks satisfy:
Meanwhile, the whole workload of a worldwide batch of samples might be represented as:
Combining 2 and three, the workload of every micro-batch of various DP might be determined.
Since the computational workload scales as with respect to sequence length, while memory consumption scales as
, it’s difficult to attain each workload and memory balance concurrently. As an alternative, the solver alternates between workload-oriented and memory-oriented objectives across stages, progressively approaching a balanced solution.
Samples whose workload exceeds the micro-batch workload quota are assigned a bigger CP size. After this step, workload imbalance is reduced, and memory becomes the dominant constraint. The goal then shifts to memory, choosing the least compute-heavy sample to fill the bucket. The remaining samples are sorted in descending order and assigned to every microbatch using the identical heuristic.
Zero-overhead execution
At runtime, the scheduling workflow must not introduce noticeable overhead into the training loop. In practice, the system needs to beat I/O pressure and solver runtime.
I/O pressure
First, constructing a scheduling plan requires an additional get_item omit the worldwide batch to gather sequence length and shape information. Two complementary techniques alleviate I/O pressure by distributing the probing get_item across the cluster and gathering only lightweight shape and sequence-length metadata through an extra communication step.
Solver runtime
To avoid blocking the primary training process, the solver runs asynchronously within the data_sampler in order that it overlaps with training iterations. To maintain space manageable, exhaustive search is replaced with a small grid search. All DP ranks are constrained to make use of the identical variety of micro-batches, and this count is swept from PP*1 as much as a small multiple of PP.
Under a set global batch size, this one-dimensional grid captures the trade-off between per-microbatch workload and pipeline bubbles. Figure 7 shows that workload variance quickly shrinks because the microbatch count grows. The “knee” point on this curve is chosen, and the search is restricted to its neighborhood to maintain solver overhead practical.


Benchmark results
With all of the enhancements introduced, the imbalance bubbles brought on by variable-length sequence distributions might be substantially reduced.
In Table 1, Dynamic CP is evaluated against a pure packing baseline under the next setup: llama-13B, global batch size 2048, PP=8, CP=8, and full recompute. 10 iterations are run, with the primary discarded as a warm-up, and the iteration time is averaged over the remaining 9. Dynamic CP achieves 1.48x and 1.25x speedups on the GitHub and CommonCrawl datasets, respectively.
In a multi-thousand-GPU industrial environment, the Dynamic CP method yields over 35% end-to-end performance improvement.
| Model Size | Dataset type | Method | TFLOPS/GPU |
| Llama 13B | GitHub | Only Packing | 195.88 |
| Llama 13B | GitHub | Dynamic CP | 289.32 |
| Llama 13B | Commoncrawl | Only Packing | 139.17 |
| Llama 13B | Commoncrawl | Dynamic CP | 174.39 |
Table 1. Comparison of Dynamic CP and pure packing methods across different datasets
Learn more
This post showed that Dynamic CP with the Megatron Core backend improves the training throughput for variable-length sequences in comparison with the fixed CP method. With sequence packing, 4D parallelism, and GPU-optimized kernels, Dynamic CP guarantees high training efficiency across model scales.
Start with:
- Megatron Core GitHub to start out training your model with variable-length sequences using Megatron Core optimizations.
- The scheduler, which can also be on GitHub.
Due to every colleague for his or her contributions to this project.
