Ulysses Sequence Parallelism: Training with Million-Token Contexts

Training large language models on long sequences has change into essential for constructing capable AI systems. As models are increasingly used for tasks like document evaluation, code understanding, complex reasoning, and RAG workloads, the necessity to process sequences of a whole bunch of hundreds—and even thousands and thousands—of tokens has grown dramatically. To place this in perspective, a mean book is roughly 250k tokens, so training on multi-document contexts or book-length inputs requires handling sequences well beyond what matches on a single GPU. Nevertheless, training with such long contexts presents significant memory challenges: the eye computation scales quadratically with sequence length, quickly exceeding GPU memory for contexts beyond tens of hundreds of tokens.

Ulysses Sequence Parallelism (a part of the Arctic Long Sequence Training (ALST) protocol from Snowflake AI Research) provides a chic solution by distributing the eye computation across multiple GPUs through attention head parallelism. On this post, we’ll explore how Ulysses works and the way it has been integrated across the Hugging Face ecosystem—from Speed up to the Transformers Trainer and TRL’s SFTTrainer.

The Challenge of Long Sequence Training

The eye mechanism in transformers scales quadratically with sequence length. For a sequence of length $n$

Consider these scenarios where long-context training is crucial:

Document understanding: Processing entire books, legal documents, or research papers
Code evaluation: Understanding large codebases with multiple interconnected files
Reasoning tasks: Models that “think” step-by-step may generate hundreds of tokens during inference
Retrieval-augmented generation: Incorporating many retrieved passages into the context

Traditional data parallelism doesn’t help here—each GPU still must process the total sequence contained in the attention block. We want a solution to split the sequence itself across multiple devices.

How Ulysses Works

Ulysses Sequence Parallelism (SP), introduced within the DeepSpeed Ulysses paper, takes a clever approach: along with splitting on the sequence dimension, it also partitions the eye heads across GPUs.

Ulysses Sequence Parallelism Overview — Ulysses splits input sequences along the sequence dimension and uses all-to-all communication to exchange key-value pairs, enabling each GPU to compute a subset of attention heads. (***Source: Snowflake Engineering Blog***)

Here’s how it really works:

Sequence Sharding: The input sequence is split along the sequence dimension across $P$
QKV Projection: Each GPU computes the query, key, and value projections for its local sequence chunk.
All-to-All Communication: An all-to-all collective operation redistributes the data so that each GPU holds all sequence positions after the projections, but only for a subset of attention heads.
Local Attention: Each GPU computes attention for its assigned heads using standard attention mechanisms (FlashAttention or SDPA).
All-to-All Communication: Another all-to-all operation reverses the redistribution, returning to sequence-sharded format.
Output Projection: Each GPU computes the output projection for its local sequence chunk.

The key insight is that attention heads are independent—each head can be computed separately. By trading sequence locality for head locality, Ulysses enables efficient parallelization with relatively low communication overhead.

Communication Complexity

Ulysses requires two all-to-all operations per attention layer, with total communication volume of $O (n \cdot d / P)$

Ring Attention communicates $O (n \cdot d)$

Integration with Accelerate

Accelerate provides the foundation for Ulysses sequence parallelism through its ParallelismConfig class and DeepSpeed integration.

Configuration

from accelerate import Accelerator
from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=4,  
    dp_shard_size=1,  
    sp_handler=DeepSpeedSequenceParallelConfig(
        sp_seq_length=None,  
        sp_seq_length_is_variable=True,
        sp_attn_implementation="flash_attention_2",  
    ),
)

accelerator = Accelerator(parallelism_config=parallelism_config)

Key Parameters

Parameter	Description
`sp_size`	Number of GPUs for sequence parallelism
`sp_backend`	Must be `"deepspeed"` for Ulysses
`sp_seq_length_is_variable`	Set to `True` for varying sequence lengths across batches
`sp_attn_implementation`	`"flash_attention_2"`, `"flash_attention_3"`, or `"sdpa"`

Using the Accelerator

When you call accelerator.prepare(), Ulysses is automatically set up:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)


model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

The prepare() call:

Registers the model with DeepSpeed’s UlyssesSPAttentionHF
Wraps the dataloader with UlyssesSPDataLoaderAdapter to handle sequence sharding
Automatically injects shift_labels for correct loss computation

Loss Aggregation

With Ulysses, each GPU computes loss on different parts of the sequence. The losses must be aggregated properly, weighted by the number of valid tokens per rank. If you’re using the Transformers Trainer or TRL’s SFTTrainer, this is handled automatically—the code below is only needed when writing a custom Accelerate training loop:

sp_size = parallelism_config.sp_size
if sp_size > 1:
    from deepspeed.utils import groups

    sp_group = groups._get_sequence_parallel_group()

    
    losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
    good_tokens = (batch["shift_labels"] != -100).view(-1).sum()
    good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)

    
    total_loss = sum(
        losses_per_rank[i] * good_tokens_per_rank[i]
        for i in range(sp_size)
        if good_tokens_per_rank[i] > 0
    )
    loss = total_loss / max(sum(good_tokens_per_rank), 1)

accelerator.backward(loss)

The weighted loss aggregation ensures correct gradients when tokens are unevenly distributed across ranks (e.g., when some ranks contain only padding or masked out prompt tokens).

Each Ulysses and Ring Attention use position_ids as an alternative of attention_mask for causal masking during training. A 4D attention mask at these sequence lengths could be just as prohibitive as the eye scores themselves—at 128k tokens, that is one other ~1TB tensor. Position IDs achieve the identical causal behavior with $O (n)$

Integration with Transformers Trainer

The Transformers Trainer provides seamless Ulysses integration through TrainingArguments.parallelism_config. It handles all of the SP-specific details mechanically—dataloader wrapping, sequence sharding, and loss aggregation—so you do not need to write down any of the custom loss code shown above.

Configuration

Just pass the identical parallelism_config from above into TrainingArguments:

from transformers import TrainingArguments

training_args = TrainingArguments(
    parallelism_config=parallelism_config,  
    per_device_train_batch_size=1,
)

What the Trainer Handles Routinely

Dataloader Wrapping: After model preparation, the Trainer wraps the dataloader with UlyssesSPDataLoaderAdapter
Loss Computation: The compute_loss method detects SP mode and routes to specialized _deepspeed_sp_compute_loss which handles:
- Gathering losses across SP ranks
- Computing valid token counts per rank
- Weighted loss aggregation
Batch Size Calculation: The effective data parallel world size accounts for SP:
```
dp_world_size = world_size // sp_size
```
Dataloader Length Adjustment: Training step calculations are adjusted for SP’s effect on iteration count

Launch Command

Use an speed up config file or command-line arguments:

speed up launch 
    --config_file deepspeed_ulysses.yaml 
    train.py 
    --per_device_train_batch_size 1

Integration with TRL SFTTrainer

TRL’s SFTTrainer builds on the Transformers Trainer and adds specific optimizations for supervised fine-tuning with long sequences.

Configuration

from trl import SFTConfig, SFTTrainer
from speed up.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=2,
    dp_shard_size=2,  
    sp_handler=DeepSpeedSequenceParallelConfig(
        sp_seq_length_is_variable=True,
        sp_attn_implementation="flash_attention_2",
    ),
)

training_args = SFTConfig(
    ...,
    parallelism_config=parallelism_config,
    max_length=32768,
    pad_to_multiple_of=2,  
    per_device_train_batch_size=1,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Key SFTConfig Parameters for Ulysses

Parameter	Description
`pad_to_multiple_of`	Must equal `sp_size` to make sure sequence divisibility
`max_length`	Global sequence length (before splitting across GPUs)
`packing`	Works well with SP — packing reduces padding waste, especially for variable-length sequences

Speed up Config File

Create alst_ulysses_4gpu.yaml:

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
mixed_precision: bf16
num_processes: 4
deepspeed_config:
  zero_stage: 3
  seq_parallel_communication_data_type: bf16
parallelism_config:
  parallelism_config_sp_size: 2
  parallelism_config_sp_backend: deepspeed
  parallelism_config_dp_shard_size: 2
  parallelism_config_sp_seq_length_is_variable: true
  parallelism_config_sp_attn_implementation: flash_attention_2

Complete Training Command

speed up launch --config_file alst_ulysses_4gpu.yaml 
    trl/scripts/sft.py 
    --model_name_or_path meta-llama/Llama-3.1-8B 
    --dataset_name trl-lib/Capybara 
    --max_length 32768 
    --packing 
    --pad_to_multiple_of 2 
    --per_device_train_batch_size 1

Shift Labels Handling

The SFTTrainer mechanically handles pre-shifted labels when Ulysses is enabled:



labels = inputs["labels"] if "shift_labels" not in inputs else None


if "shift_labels" in inputs:
    shift_logits = outputs.logits.contiguous()
    shift_labels = inputs["shift_labels"]
else:
    shift_logits = outputs.logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()

Comparing Ulysses and Ring Attention

Each Ulysses and Ring Attention enable long-context training, but they’ve different characteristics:

Aspect	Ulysses (DeepSpeed)	Ring Attention (FSDP2)
Parallelism Method	Attention head partitioning	Ring-based KV exchange
Backend	DeepSpeed ZeRO	PyTorch FSDP2
Attention Support	FlashAttention 2/3, SDPA	SDPA only
Communication	Two `all-to-all`s per layer	P2P ring communication
Comm volume per GPU	O(total_seq x hidden / sp_size)	O(total_seq x hidden)
Sequence Divisibility	`sp_size`	`cp_size * 2`
Num Head Constraint	`num_heads >= sp_size`	None

When to Select Ulysses vs Ring Attention

Since switching between the 2 only requires changing the speed up config, we recommend trying each and comparing performance and memory usage in your specific setup. The principal constraint is that Ulysses requires num_heads >= sp_size, while Ring Attention has no such limitation.

Best Practices

1. Sequence Length Divisibility

At all times ensure your sequence length is divisible by sp_size:

training_args = SFTConfig(
    pad_to_multiple_of=4,  
    max_length=32768,  
)

2. Use Flash Attention

Flash Attention 2 provides cleaner output and higher performance than SDPA:

parallelism_config = ParallelismConfig(
    sp_handler=DeepSpeedSequenceParallelConfig(
        sp_attn_implementation="flash_attention_2",
    ),
)

Use Flash Attention 3 for Hopper and look out for Flash Attention 4 release for Blackwell (FA2 on Blackwell is sort of slow).

3. Mix with DeepSpeed ZeRO

For very large models, mix Ulysses with ZeRO Stage 3:

deepspeed_config:
  zero_stage: 3
  offload_optimizer:
    device: cpu

If the model is large, you possibly can offload the params as well by adding to the above:

  offload_param:
    device: cpu

5. Use memory fragmentation-friendly PyTorch allocator

This environment variable will allow for an extended sequence length:

export PYTORCH_ALLOC_CONF=expandable_segments:True

6. 2D Parallelism Configuration

Balance SP and DP to your GPU count:

GPUs	`sp_size`	`dp_shard_size`	Use Case
4	2	2	Balanced throughput and sequence length
4	4	1	Maximum sequence length
8	2	4	Higher throughput with moderate sequence length
8	4	2	Longer sequences with moderate throughput

Remember: dp_replicate_size × dp_shard_size × sp_size = num_processes

7. Liger-Kernel

If your required model architecture is supported by Liger-Kernel, it’s fully compatible with Ulysses SP and could be enabled with a single flag:

training_args = SFTConfig(
    use_liger_kernel=True,
)

The principal memory saving comes from FusedLinearCrossEntropy which avoids materializing the total logits tensor during loss calculation. The savings grow with longer sequences where the logits tensor is larger.

Moreover, you possibly can enable TiledMLP to further extend sequence length — like FusedLinearCrossEntropy, it saves working memory by tiling large matrix operations.

8. Token Distribution Across Ranks

You need not worry about manually balancing tokens across SP ranks—the loss aggregation code handles uneven distributions gracefully (including ranks with zero valid tokens). With random batching over a fairly sized dataset, the distribution evens out statistically over training.

Benchmarks

To quantify the advantages of Ulysses SP, we trained Qwen3-4B on the Gutenberg English streaming dataset using TRL’s SFTTrainer. All experiments ran on H100 80GB GPUs with DeepSpeed ZeRO-3, CPU optimizer offloading, gradient checkpointing, and flash-attn2 as the eye backend.

Setup

Config	GPUs	SP	DP	Seq Length	Grad Acc	Global Batch
Baseline	1	1	1	8K	8	8
SP=4	4	4	1	8K	8	8
SP=4	4	4	1	32K	8	8
SP=4	4	4	1	64K	8	8
SP=4	4	4	1	96K	8	8

The benchmark runs within the table above use the identical global batch size (8 micro-batches), cosine learning-rate schedule, and seed, so those benchmark loss curves are directly comparable.

Loss Curve Matching Diagnostics (4 GPU)

To confirm SP-vs-DP loss equivalence, we ran controlled 4-GPU A/B experiments with an identical seed, model, optimizer, learning-rate schedule, and data order.

Methodology for Fair DP vs SP Comparison

Compared setups:

DP=4, SP=1, GAS=1 (baseline)
DP=1, SP=4, GAS=4 (Ulysses SP)

For fair comparison, GAS must scale with SP:

Ulysses SP splits the sequence across SP ranks, so each SP rank sees roughly 1/SP of the sequence tokens per micro-step.
If GAS is unchanged, each optimizer step in SP aggregates fewer total tokens than the DP baseline.
Setting GAS=SP keeps effective tokens per optimizer step matched:
- DP tokens/step: dp_world_size * micro_batch * seq_len * GAS = 4 * B * L * 1
- SP tokens/step: dp_world_size * micro_batch * (L/SP) * GAS * SP_ranks = 1 * B * (L/4) * 4 * 4 = 4 * B * L

Canonical loss on Gutenberg for DP=4 vs SP=4 — On Gutenberg text (20 steps), canonical loss matches inside logging precision between `DP=4,SP=1,GAS=1` and `DP=1,SP=4,GAS=4`.

Measured over 20 steps on 4 GPUs in controlled equivalence harnesses:

Harness	Metric	DP vs SP setting	Mean abs diff	Max abs diff
`Trainer`	`loss`	DP=4, SP=1 vs DP=1, SP=4	0.0054	0.0131
`SFTTrainer`	logged `loss`	DP=4, SP=1 vs DP=1, SP=4	0.0811	0.0812
`SFTTrainer`	canonical NLL	DP=4, SP=1 vs DP=1, SP=4	0.000004	0.000005

Takeaway: under matched token budget, SP and non-SP match on canonical token-normalized loss. The remaining difference is in trainer-reported logging (loss), not within the underlying cross-entropy objective.

Memory Reduction

Peak GPU Memory per Rank — SP=4 reduces per-GPU memory by 3.3x at the identical sequence length, enabling training at as much as 96K tokens on 4× H100 80GB. At 128K, the model OOMs.

Config	Seq Length	Peak Memory	Notes
DP=4 (4 GPU)	8K	22.4 GB	Baseline — no SP
SP=4 (4 GPU)	8K	22.8 GB	Similar memory at same seq length
SP=4 (4 GPU)	32K	35.0 GB	4x longer than DP baseline
SP=4 (4 GPU)	64K	50.5 GB	8x longer than DP baseline
SP=4 (4 GPU)	96K	66.0 GB	12x longer than DP baseline
SP=4 (4 GPU)	128K	OOM	Exceeds 80 GB limit

At 8K tokens, DP=4 and SP=4 use nearly the identical memory per GPU (~22 GB with ZeRO-3). The advantage of SP is that it enables scaling to for much longer sequences: at 96K tokens (12x longer), peak memory is 66 GB — still throughout the H100’s 80 GB capability. At 128K, the model OOMs, establishing the sensible limit for this configuration. DP=4 without SP cannot scale beyond 8K for this model.

Throughput

Config	Seq Length	Tokens/s	vs Baseline
Baseline (1 GPU)	8K	3,633	—
SP=4 (4 GPU)	8K	3,933	~1x
SP=4 (4 GPU)	32K	7,733	2.1x
SP=4 (4 GPU)	64K	13,396	3.7x

At the identical sequence length (8K), SP=4 has comparable throughput to the single-GPU baseline — the all-to-all communication overhead is minimal on NVLink-connected GPUs. The true profit comes from longer sequences: as sequence length grows, the quadratic attention computation dominates over communication and other overheads, making each training step increasingly compute-efficient. Each step also processes proportionally more tokens, so throughput scales with sequence length. At 64K, SP=4 processes 13,396 tokens/second — 3.7x the baseline.

These results use only 4 GPUs with SP=4. With 8 GPUs (SP=8), you possibly can push to even longer sequences — as much as 256K+ tokens — or use 2D parallelism (SP=4, DP=2) to mix long-context training with data-parallel throughput.

Requirements

HF Speed up: deepspeed>=0.18.1 speed up>=1.12
HF Trainer: deepspeed>=0.18.1 speed up>=1.12 transformers>=5.0
HF TRL: deepspeed>=0.18.1 speed up>=1.12 transformers>=5.0 trl>=0.18.0

Use flash_attention_2 for Ampere GPUs, or flash_attention_3 for Hopper GPUs. Wait for flash_attention_4 on Blackwell 🕰.

Ulysses Sequence Parallelism: Training with Million-Token Contexts

Contents

The Challenge of Long Sequence Training

How Ulysses Works

Communication Complexity

Integration with Accelerate

Configuration

Key Parameters

Using the Accelerator

Loss Aggregation

Integration with Transformers Trainer

Configuration

What the Trainer Handles Routinely

Launch Command

Integration with TRL SFTTrainer

Configuration

Key SFTConfig Parameters for Ulysses

Speed up Config File

Complete Training Command

Shift Labels Handling

Comparing Ulysses and Ring Attention

When to Select Ulysses vs Ring Attention

Best Practices

1. Sequence Length Divisibility

2. Use Flash Attention

3. Mix with DeepSpeed ZeRO

5. Use memory fragmentation-friendly PyTorch allocator

6. 2D Parallelism Configuration

7. Liger-Kernel

8. Token Distribution Across Ranks

Benchmarks

Setup

Loss Curve Matching Diagnostics (4 GPU)

Methodology for Fair DP vs SP Comparison

Memory Reduction

Throughput

Requirements

Resources

Documentation

Examples

Papers

Related Blog Posts

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.