Unlocking Efficiency with Co-located vLLM in TRL

TRL supports training LLMs using GRPO, a web-based learning algorithm recently introduced within the DeepSeekMath paper. In GRPO, the model learns from its own outputs: it generates responses during training, receives feedback, and uses that feedback to enhance itself over time.

This makes generation a critical step within the training loop — and likewise a significant bottleneck. To hurry up generation, TRL integrates with vLLM. This mix helps you to train powerful models more efficiently in GRPO setup. Nonetheless, there’s a catch.

🧨 The Problem

Before TRL v0.18.0, vLLM was only supported in server mode, running as a separate process on different GPUs from the training job. It communicated with the training script over HTTP, which made the setup modular and straightforward to make use of — but additionally introduced GPU inefficiencies.

Here’s what happens:

During training, the model must generate completions steadily.
The trainer sends a request to the vLLM server, which runs by itself GPUs.
While vLLM generates, the training GPUs sit idle and wait.
Once generation is completed, vLLM GPUs change into idle, and training resumes.

This “ping-pong” between training and generation causes:

Wasted GPU time on either side
Increased demand for extra GPUs simply to run inference
Reduced overall throughput and better cost

In online learning methods like GRPO — where generation happens continuously — this inefficiency becomes much more painful. You spend more on hardware, but do not get the performance you’d expect.

So, the important thing query becomes: Can we share the identical GPUs for each training and generation, as an alternative of separating them?

💡 The Opportunity

The fundamental issue was that training and inference ran on separate GPUs, resulting in idle time and underutilization. The natural solution? Run each on the identical GPUs. As a substitute of getting vLLM operate as a standalone server in its own process and devices, what if vLLM could run alongside the training code, throughout the same distributed process group? This could allow us to launch a single distributed job where training and inference share the identical devices, switching between tasks efficiently without wasting resources.

This approach is what we seek advice from as colocation. Training and inference are co-located on the identical GPUs and coordinated via the identical process group, allowing them to take turns easily — no extra hardware needed.

Previously, this wasn’t possible in TRL, which relied on vLLM as an external HTTP server. That modified with our PR #3394, which added support for vLLM’s external launcher and true integration into the training process.

What It Enables

Unified Execution: By embedding vLLM in the identical process group, each training and inference tasks can share the identical GPUs, taking turns as an alternative of waiting on one another. This reduces idle time and boosts overall efficiency.
Skip HTTP Communication: No need for REST API calls or networking — vLLM runs inline with the training loop, avoiding overhead and latency.
Torchrun Compatibility: Works seamlessly with torchrun, so it is easy to scale across nodes with minimal config changes.
TP and DP Support: Compatible with Tensor Parallelism and Data Parallelism, making it suitable for large-scale training runs.
SPMD Execution Pattern: Uses a Single Program, Multiple Data (SPMD) model, where each GPU runs its own instance of the engine in sync. Ideal for distributed multi-GPU, multi-node setups.
Simplified Deployment: You not need to keep up a separate server script — vLLM is launched and controlled directly inside your training job.
Enhanced Throughput: By avoiding idle GPUs and eliminating inter-process communication, the system delivers faster training and generation, especially necessary in online learning setups like GRPO.
Robust Inter-process Communication: That is more robust since it avoids the complexity of organising distributed process groups between independent processes, as required in server mode.

Due to this feature, co-located training and inference is not any longer a hack — it’s now first-class, scalable, and production-ready.

🧩 Design: From Separate Servers to Shared GPUs

The shift from server TRL to co-located TRL is all about smarter GPU usage. The diagram below shows the difference:

Server TRL Setup (Top Row)

Within the server TRL setup, training and inference run on separate GPUs. For instance:

GPUs 0 through 2 are used for training.
GPU 3 is fully dedicated to running vLLM as a separate server.

During training steps, GPU 3 sits idle.
During generation steps (inference), GPUs 0–2 are idle while GPU 3 generates outputs.

This results in:

Inefficient GPU usage, with devices steadily waiting on one another
Extra GPUs provisioned solely for inference
Increased cost and complexity

Co-located TRL Setup (Bottom Row)

In contrast, the co-located TRL setup runs each training and vLLM on the same GPUs. Each GPU:

Runs the training loop
Launches a vLLM engine throughout the same process

Training and inference take turns using the GPU’s resources — no need for dedicated devices or separate processes.

This design:

Reduces idle time
Minimizes inter-process and HTTP communication
Fully utilizes available GPU memory and compute
Delivers faster throughput without increasing hardware requirements

🛠️ Implementation Notes

As a substitute of launching vLLM as a server, the trainer now launches vLLM in-process using the external launcher, as shown below:

self.llm = LLM(
    model=model.name_or_path,
    tensor_parallel_size=args.vllm_tensor_parallel_size,
    gpu_memory_utilization=self.vllm_gpu_memory_utilization,
    max_num_seqs=self.args.per_device_train_batch_size
        * self.vllm_tensor_parallel_size
        * self.args.gradient_accumulation_steps,
    max_model_len=self.max_prompt_length + self.max_completion_length,
    distributed_executor_backend="external_launcher",
    
    seed=self.accelerator.process_index // self.vllm_tensor_parallel_size,
)

Co-located vLLM respects the torch.distributed process group and rank structure. This enables vLLM to be initialized alongside training without conflict and makes TP/DP setups work seamlessly:

if self.vllm_tensor_parallel_size > 1:
    
    self.tp_group, _ = torch.distributed.new_subgroups_by_enumeration(
        [
            list(range(i * self.vllm_tensor_parallel_size, (i + 1) * self.vllm_tensor_parallel_size))
            for i in range(self.accelerator.num_processes // self.vllm_tensor_parallel_size)
        ]
    )

Co-located vLLM not relies on REST APIs — it runs directly in memory and communicates via native Python calls:

if self.vllm_tensor_parallel_size > 1:
    orig_size = len(prompts_text)
    gathered_prompts = [None for _ in range(self.vllm_tensor_parallel_size)]
    torch.distributed.all_gather_object(gathered_prompts, prompts_text, group=self.tp_group)
    all_prompts_text = [p for sublist in gathered_prompts for p in sublist]
else:
    all_prompts_text = prompts_text

with profiling_context(self, "vLLM.generate"):
    all_outputs = self.llm.generate(all_prompts_text, sampling_params=sampling_params, use_tqdm=False)

completion_ids = [output.token_ids for outputs in all_outputs for output in outputs.outputs]

if self.vllm_tensor_parallel_size > 1:
    local_rank_in_group = torch.distributed.get_rank(group=self.tp_group)
    tp_slice = slice(local_rank_in_group * orig_size, (local_rank_in_group + 1) * orig_size)
    completion_ids = completion_ids[tp_slice]

To make use of this setup, simply set vllm_mode=”colocate” in your GRPO configuration:

training_args = GRPOConfig(
    ...,
    use_vllm=True,
    vllm_mode="colocate",
)

Note: Depending on the model size and the general GPU memory requirements for training, chances are you’ll need to regulate the vllm_gpu_memory_utilization parameter in GRPOConfig to avoid underutilization or out-of-memory errors.

📊 Showcase: Co-located vs. Plain TRL Performance

To measure the impact of colocation, we ran a series of experiments comparing the normal server mode (where vLLM runs on a separate GPU as a standalone server) with the brand new co-locate mode (where training and inference share the identical GPUs).

In server mode, only 7 GPUs are used for training because 1 GPU is fully dedicated to the vLLM inference server.

In co-locate mode, all 8 GPUs are used for training — increasing the effective batch size by default.

To make sure a good comparison, we normalized throughput in server mode by an element of 8/7. This adjustment accounts for the greater training capability in co-locate mode and allows us to check the 2 setups under equal training conditions.

Experiment 1: 1.5B Model — Various Batch Sizes

Because the batch size increases, throughput improves in each setups.
Co-located setup reaches as much as 1.43× speedup at the most important batch size.
Larger batches make higher use of shared GPU memory in co-located mode.

Experiment 2: 1.5B Model — Various Tensor Parallelism (TP)

Within the co-located setup, increasing TP reduces performance.
More sharding introduces more communication overhead — which is not ideal for smaller models.
Takeaway: For small models, avoid over-sharding in co-located mode.

Experiment 3: 7B Model — Various Batch Sizes

Again, co-located mode scales higher with batch size.
Gains reach 1.35× speedup at the most important batch tested.

Experiment 4: 7B Model — Various Tensor Parallelism (TP)

Opposite trend from the 1.5B model.
With 7B, more TP improves throughput, reaching as much as 1.73× speedup.
Larger models profit from sharding in co-located setups.

📊 Scaling to 72B Model

When training large models like Qwen2.5-Math-72B, it is important to make use of the best strategies to make training efficient, scalable, and stable across many GPUs and nodes. In our setup, we combined co-located vLLM with several key optimizations to make this work efficiently.

Sleep Mode in vLLM

When using co-located training, managing GPU memory is crucial in order that each training and inference can run easily on the identical devices. To support this, we added vLLM’s sleep() API into the GRPO training loop.

The sleep() function temporarily pauses the vLLM engine and frees up GPU memory. It supports two levels:

Level 1: Unloads model weights from GPU (keeps them in CPU memory) and clears the KV cache.
Useful when the identical model might be reused soon.
Level 2: Unloads each model weights and KV cache entirely.
Best for scenarios where the model will change or won’t be reused immediately.

In GRPO, the model is updated after every step — so we use Level 2 sleep.

Advantages of Level 2 sleep:

Maximizes free GPU memory for training
Avoids memory contention between training and generation
Keeps colocation efficient, even for giant models like Qwen2.5-72B

This small addition makes a big difference in enabling smooth and scalable co-located training.

DeepSpeed Optimizations

To coach large models like Qwen2.5-72B, we depend on DeepSpeed ZeRO Stage 3, the identical setup utilized in plain TRL.

ZeRO helps scale large models by distributing memory across GPUs. Stage 3 goes further by partitioning:

Model weights
Gradients
Optimizer states

This is important for models that may’t fit on a single GPU. With ZeRO Stage 3, each GPU handles only a portion of the model.

Additional options we enable:

"offload_optimizer": {"device": "cpu"}
Moves optimizer states to CPU to free GPU memory — critical in co-located setups.
"overlap_comm": true
Enables communication overlap with computation, speeding up training.
"contiguous_gradients": true
Allocates gradients in a single memory block, improving memory access and reducing fragmentation.

These optimizations help train 72B models efficiently, and ensure colocation stays stable under tight memory constraints.

Speed up Integration

As really useful in TRL, we use Speed up, a light-weight library that simplifies distributed training. It handles:

Multi-GPU and multi-node job launching
Data parallelism
Gradient accumulation
Distributed data loading

This makes the setup clean, scalable, and straightforward to keep up.

Experiment 5: Qwen2.5-Math-72B — Throughput, Accuracy, and Benchmark Results

Throughput

Even with 4 fewer GPUs, the co-locate setup is ~1.26× faster than plain TRL.
This highlights the effectiveness of smarter GPU sharing and memory cleanup using sleep().

Reward Curve

Training reward plots for co-locate and plain setups are nearly similar, demonstrating that:

Co-located training preserves accuracy
There’s no regression in model learning performance

Math500 Benchmark

We evaluated three models: Base model, Co-locate-trained model, Plain-trained model on the Math500 benchmark. Each trained models outperform the bottom, and the co-locate model performs on par with the plain-trained model — confirming that colocation doesn’t compromise downstream performance.

🎓 Challenges & Lessons Learned & next steps

Through our work on scaling GRPO training with co-located vLLM, we have faced several critical challenges and learned necessary lessons about efficiency, flexibility, and system design when training large models.

Challenges

Tensor Parallelism Bug in vLLM ≥ 0.8.0. Tensor Parallelism (TP) with external_launcher stopped working in vLLM version 0.8.0 and above. This was tracked under Issue #15895. To discover the breaking point, we followed the approach described on this vLLM developer blog post, which provides wheels for each commit. After bisecting, we identified the breaking commit as cc10281. The basis cause was determinism — the newer versions required explicitly setting the random seed. Once the seed was set, the problem went away.
Level 2 Sleep Buffer Bug. Initially, level 2 sleep didn’t work accurately once we tried to reload weights using load_weights. This issue was tracked in Issue #16564. The issue was that model buffers (like running mean/var in BatchNorm) weren’t restored after waking up from sleep. The fix got here with PR #16889, which added logic to explicitly restore buffers when waking up from level 2 sleep. We now keep a replica of the unique buffers and manually reapply them after loading latest weights.
Segmentation Fault on Exit. There’s still an open issue with vLLM sleep causing a segmentation fault at the top of coaching when closing processes. This was reported in Issue #16993. This crash happens during shutdown but doesn’t break training itself, so we were in a position to complete all demos and experiments shared on this blog. Nonetheless, we’re waiting for an official fix before integrating sleep() fully into TRL upstream.

These challenges weren’t blockers, but they required careful debugging, version control, and a deeper understanding of how vLLM manages memory and parallelism under the hood.

Lessons Learned

Co-located inference dramatically improves GPU utilization. By allowing training and generation to share the identical GPUs, we eliminate idle time and reduce hardware requirements — achieving higher throughput even with fewer GPUs.
vLLM’s sleep() feature is important for large-scale colocation. It enables fine-grained control over memory usage, allowing training to totally reclaim GPU memory between generation steps — a key enabler for models like Qwen2.5-72B.
DeepSpeed ZeRO Stage 3 is important for training large models. It allows extremely large networks to suit into memory by distributing model weights, gradients, and optimizer states across multiple GPUs. In our experience, enabling contiguous_gradients helped reduce memory fragmentation, while offloading the optimizer to the CPU freed up critical GPU memory — each of which were especially helpful in colocated setups.
Colocation is powerful but comes with trade-offs. It really works best when GPU memory is rigorously managed, often requiring manual tuning of memory usage parameters like vllm_gpu_memory_utilization. While it offers clear throughput advantages and reduces idle GPU time, colocation might not be ideal for models with tight memory budgets or when memory fragmentation is just not well controlled. When done right, though, it unlocks significant efficiency gains.
TP/DP compatibility, Speed up, and torchrun support make deployment seamless. Despite the complexity of the underlying architecture, the whole system could be launched and scaled with standard distributed tools.
Co-located training maintains model quality. Across multiple benchmarks (Math500, AIME24), co-located and plain setups produced comparable results, validating that performance isn’t sacrificed for efficiency.

✅ Conclusion

This blog post explored how co-locating vLLM with GRPO training unlocks significant efficiency gains when training large language models — including models as large as Qwen2.5-72B.

Traditionally, TRL only supported vLLM in server mode, which required separate processes and GPUs for inference, resulting in wasted compute and idle time. With the introduction of vLLM’s external launcher and the colocation PR in TRL PR #3394, we will now run training and inference throughout the same distributed process group, on the identical GPUs, with full support for TP, DP, and Speed up.

While challenges remain — comparable to version-specific vLLM bugs and edge cases comparable to with sleep() — the general results show that co-located GRPO is a practical, scalable solution for training large models efficiently. We’re excited to proceed refining this setup, integrating features like FSDP, and pushing the boundaries of enormous model training — making it faster, cheaper, and more accessible for everybody constructing the following generation of LLMs.

✅ Give It a Try!

Below is an example to check out GRPO training with co-located vLLM.

📄 `train_grpo_colocate.py`

from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer


dataset = load_dataset("trl-lib/tldr", split="train")


def reward_len(completions, **kwargs):
    return [-abs(20 - len(completion)) for completion in completions]


training_args = GRPOConfig(
    output_dir="Qwen2-0.5B-GRPO",
    logging_steps=1,
    use_vllm=True,
    vllm_mode="colocate",
    vllm_tensor_parallel_size=1,
    vllm_gpu_memory_utilization=0.3,
    max_prompt_length=512,
    max_completion_length=1024,
    max_steps=2,
    num_generations=4,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    push_to_hub=False,
    report_to=None
)


trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_len,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()

Source link

Unlocking Efficiency with Co-located vLLM in TRL

🧨 The Problem

💡 The Opportunity

What It Enables

🧩 Design: From Separate Servers to Shared GPUs

Server TRL Setup (Top Row)

Co-located TRL Setup (Bottom Row)

🛠️ Implementation Notes

📊 Showcase: Co-located vs. Plain TRL Performance

Experiment 1: 1.5B Model — Various Batch Sizes

Experiment 2: 1.5B Model — Various Tensor Parallelism (TP)

Experiment 3: 7B Model — Various Batch Sizes

Experiment 4: 7B Model — Various Tensor Parallelism (TP)

📊 Scaling to 72B Model

Sleep Mode in vLLM

DeepSpeed Optimizations

Speed up Integration

Experiment 5: Qwen2.5-Math-72B — Throughput, Accuracy, and Benchmark Results

Throughput

Reward Curve

Math500 Benchmark

🎓 Challenges & Lessons Learned & next steps

Challenges

Lessons Learned

✅ Conclusion

✅ Give It a Try!

📄 `train_grpo_colocate.py`

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Speech Synthesis, Recognition, and More With SpeechT5

Parameter-Efficient Positive-Tuning using 🤗 PEFT

Distributed Reinforcement Learning for Scalable High-Performance Policy Optimization

Zero-shot image-to-text generation with BLIP-2

Why we’re switching to Hugging Face Inference Endpoints, and possibly it is best to too

Unlocking Efficiency with Co-located vLLM in TRL

🧨 The Problem

💡 The Opportunity

What It Enables

🧩 Design: From Separate Servers to Shared GPUs

Server TRL Setup (Top Row)

Co-located TRL Setup (Bottom Row)

🛠️ Implementation Notes

📊 Showcase: Co-located vs. Plain TRL Performance

Experiment 1: 1.5B Model — Various Batch Sizes

Experiment 2: 1.5B Model — Various Tensor Parallelism (TP)

Experiment 3: 7B Model — Various Batch Sizes

Experiment 4: 7B Model — Various Tensor Parallelism (TP)

📊 Scaling to 72B Model

Sleep Mode in vLLM

DeepSpeed Optimizations

Speed up Integration

Experiment 5: Qwen2.5-Math-72B — Throughput, Accuracy, and Benchmark Results

Throughput

Reward Curve

Math500 Benchmark

🎓 Challenges & Lessons Learned & next steps

Challenges

Lessons Learned

✅ Conclusion

✅ Give It a Try!

📄 train_grpo_colocate.py

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

📄 `train_grpo_colocate.py`

What are your thoughts on this topic?
Let us know in the comments below.