Lessons from 16 Open-Source RL Libraries

TL;DR — For those of you who haven’t got time to read 5,000 words about async RL plumbing (we get it, you may have models to coach):

The issue: In synchronous RL (reinforcement learning) training, data generation (model inference to create data samples) dominates wall-clock time — a single batch of 32K-token rollouts on a 32B (32-billion parameter) model can take hours, while the GPUs used for training remain idle.

The answer everyone converged on: Disaggregate (separate) inference and training onto different GPU pools, connect them with a rollout buffer (temporary storage for model outputs), and transfer weights asynchronously (without waiting), so neither side waits for the opposite.

We surveyed 16 open-source libraries that implement this pattern and compared them across 7 axes: orchestration primitives, buffer design, weight sync protocols, staleness management, partial rollout handling, LoRA support, and distributed training backends.

Key findings: Ray dominates orchestration (8/16 surveyed distributed computing libraries). The NCCL (NVIDIA Collective Communications Library) broadcast is the default method for transferring model weights. Staleness management refers to how outdated data samples are handled, starting from simply dropping old samples to using advanced importance-sampling correction. LoRA (Low-Rank Adaptation) training is sparsely supported. Distributed MoE (Mixture of Experts) support is the emerging differentiator.

When you’d quite skip straight to the nice part, here’s the total comparison table (no reading required, we can’t judge).

But seriously, in case you stick around, you would possibly learn a thing or two about why your GPUs are idle 60% of the time.

Click to expand Table of Contents

1. Motivation: From synchronous RL training to async architectures

Async RL training has emerged because the dominant paradigm for post-training at scale. Several trends in modern post-training have made synchronous training loops nearly not possible to scale:

Long rollouts from reasoning models. Chain-of-thought training produces very long rollouts, and a single synchronous generation batch can take hours to finish on a single GPU. During all of that point, training GPUs sit completely idle.
Value-function-free trainers like GRPO use group-relative benefits. This implies generating as much as G times more rollouts per prompt, and your entire batch is gated by the slowest completion within the group.
The rise of agentic RL training. When models interact with tools, sandboxes, and external environments across multi-turn trajectories, rollout lengths and latencies develop into highly variable. A straightforward API call might return in seconds, while a fancy reasoning chain with tool use can run for minutes or hours. MiniMax’s Forge framework, used to coach MiniMax-M2.5, illustrates the size this reaches in practice: context lengths as much as 200K tokens, over 100 thousand distinct agent scaffolds and environments, and every day throughput on the order of hundreds of thousands of samples. At this scale, any synchronous barrier between generation and training becomes a severe bottleneck. The straggler problem alone (where a handful of slow rollouts block a complete batch) can idle tons of of GPUs.

The open-source ecosystem has converged on a standard architectural response: disaggregate inference from training onto separate GPU pools, connect them with a rollout buffer, and let either side run concurrently.

We’re developing a brand new async trainer for TRL, probably the most widely used libraries for model post-training. To guide our design, we surveyed sixteen open-source libraries that were built from the bottom up around asynchronous training and compared them across seven axes: orchestration primitives, buffer design, weight sync protocols, staleness management, partial rollout handling, LoRA support, and distributed training backends. This text distills the design principles we extracted from that survey.

Beyond RL, the necessity for async infrastructure is increasingly evident. For instance, on-policy distillation, where a student generates sequences and a teacher scores them, mirrors GRPO but swaps the reward function for a teacher forward pass. Recognizing this structural similarity, all the pieces on this survey applies equally to async distillation. We’ll return to this broader point in Section 5.

1.1 How TRL Does RL Training Today

TRL’s current GRPOTrainer implements the total GRPO loop (prompt sampling, generation, reward scoring, advantage computation, gradient update, and weight sync) in a single synchronous training_step() call. This design is easy and proper, but it surely cannot overlap generation with training, leaving significant GPU utilisation on the table.

the GRPOTrainer, we now have the next phases sequentially inside each training step:

Prompt sampling: draw a batch of prompts from the dataset. Nothing crazy here, let’s proceed.
Generation, calls model.generate() (or forward requests to a vLLM server) to provide G completions per prompt. That is autoregressive and dominates wall-clock time.
Reward scoring: evaluate each completion against one or many reward functions.
Advantage computation
Forward and backward passes: compute the clipped policy gradient loss and backpropagate.
Optimizer step, update model weights.
Weight sync, push updated weights to the inference engine (vLLM) so the subsequent generation uses the brand new policy.

Each phase blocks until completion before the subsequent begins. The timeline looks like this:

TRL offers the steps_per_generation config choice to reuse a single set of rollouts across multiple gradient steps (temporal reuse), amortizing the generation cost. However the generation call itself stays fully synchronous and blocking; the trainer cannot begin gradient computation until every completion within the batch has finished.

The library also supports running vLLM in server mode as a separate process. It frees the training GPU during generation, but two hard synchronisation barriers remain: the HTTP calls until all completions return, and the load sync blocks each the trainer and vLLM through the transfer.

1.2 Colocated vs. Disaggregated Training

Before discussing async training, it is important to grasp the 2 deployment topologies for RL training with a separate inference engine:

Colocated mode places inference and training on the same set of GPUs. A single GPU (or TP group) holds each the training model (under FSDP or ZeRO) and the inference engine (vLLM or SGLang). Just one role is energetic at a time: during generation, the training model’s parameters could also be offloaded or resharded into an inference-friendly layout (e.g., from FSDP shards to vLLM’s tensor-parallel layout); during training, the inference engine is paused or put to sleep. Weight “sync” is actually free; it’s at most an in-place resharding on the identical GPU, not a network transfer. The advantage of the colocated mode is simplicity and value; you would like fewer total GPUs. The elemental limitation is that inference and training cannot overlap. For instance, here is the Trl with vllm in colocate_mode:

Disaggregated mode places inference and training on separate GPU pools. The inference pool runs vLLM or SGLang repeatedly; the training pool runs the optimizer repeatedly. The 2 pools communicate via a weight synchronisation protocol (NCCL broadcast, filesystem checkpoint, HTTP, etc.) and a knowledge transfer mechanism (Ray object store, Redis streams, shared memory, etc.)

The most important advantage of the disaggregated mode is that inference and training can run concurrently. While the trainer computes gradients on batch N, the inference pool is already generating rollouts for batch N+K, enabling async training. Nonetheless, this profit comes at a value: additional GPUs are required.

Concurrency, asynchronicity, and parallelism are distinct concepts that usually get conflated. In this text, after we say “async training,” we mean something specific: generation and training running in parallel, with effective overlap; the inference pool is producing the subsequent batch of rollouts while the training pool is computing gradients on the present batch. That is fundamentally a disaggregated-mode capability. Colocated mode can profit from optimisations like sleep/wake memory management or fast in-place resharding to hurry up inference, but it surely cannot achieve true simultaneous overlap; inference and training still take activates the identical GPUs. Every library on this survey that implements meaningful async overlap uses disaggregated mode as the muse.

1.3 The Generation Bottleneck

In RL training for reasoning models, autoregressive generation dominates wall-clock time. A single rollout for a math or coding task can produce 8K–64K tokens of chain-of-thought reasoning (see QED-Nano rollout lengths).

To ground this concretely, consider vLLM benchmarks on a single H100 80GB GPU (bf16, no quantisation, offline throughput mode). A 7B model (DeepSeek-R1-Distill-Qwen-7B) achieves ~6,300 output tokens/s aggregate throughput; a 32B model (DeepSeek-R1-Distill-Qwen-32B) drops to ~1,200 output tokens/s. These are total throughput across all concurrent requests, the number the inference engine can push through per second, no matter what number of sequences share the GPU.

Now consider a typical GRPO training step: G=8 completions per prompt × 64 prompts/batch = 512 rollouts. How long does generation take?

Output length per rollout	Total output tokens (512 rollouts)	Time on 1×H100 (7B @ ~6K tok/s)	Time on 1×H100 (32B @ ~1.2K tok/s)
2K tokens (short CoT)	~1M tokens	~3 min	~14 min
8K tokens (medium CoT)	~4M tokens	~11 min	~56 min
32K tokens (long CoT)	~16M tokens	~45 min	~3.7 hours

Even on the short end (2K tokens generated with a 7B model), generation alone consumes several minutes per training step. On the long end, where frontier reasoning models increasingly operate, a single generation phase can take hours on one GPU. Scaling to eight inference GPUs divides these times by roughly 8× (assuming linear throughput scaling), but even then, 32K-token rollouts on a 32B model still take ~28 minutes per step.

The straggler problem compounds this further. In group-based algorithms like GRPO, you sample G completions per prompt. The batch cannot proceed until the slowest completion finishes. Chain-of-thought output lengths are highly variable; a single prompt might produce completions starting from 1K to 32K tokens. The batch is gated by the longest completion, and continuous batching only partially mitigates this: shorter sequences unlock slots for brand new work, however the last sequence in a GRPO group still blocks the group’s reward computation and training step.

1.4 The Core Insight

Every library on this survey has independently converged on the identical architectural principle: physically separate inference GPUs from training GPUs, and push weights asynchronously, so generation never stops and training never waits.

The inference pool runs repeatedly, feeding accomplished rollouts right into a buffer. The training pool pulls from the buffer, computes gradient updates, and periodically pushes latest weights back to the inference pool to maintain it in sync. The 2 loops run at their very own pace, decoupled by the buffer.

This setup is very scalable, but it surely introduces a brand new class of problems: staleness (rollouts generated under an old policy), weight synchronisation overhead, partial rollout handling, etc. The remainder of this text dissects intimately how current open-source libraries address these issues.

2. Libraries Surveyed

Library	Organisation	Repo	GitHub ⭐ (Mar. ’26)
AReaL	inclusionAI/Ant Group	github.com/inclusionAI/AReaL	4,338
ART	CoreWeave	github.com/OpenPipe/ART	8,952
Atropos	NousResearch	github.com/NousResearch/atropos	878
MILES	radixark	github.com/radixark/miles	950
NeMo-RL	NVIDIA	github.com/NVIDIA-NeMo/RL	1,383
OAT	SAIL-SG	github.com/sail-sg/oat	637
open-instruct	AI2 (AllenAI)	github.com/allenai/open-instruct	3,611
PipelineRL	ServiceNow	github.com/ServiceNow/PipelineRL	374
PRIME-RL	PrimeIntellect	github.com/PrimeIntellect-ai/prime-rl	1,114
ROLL	Alibaba	github.com/alibaba/ROLL	2,921
SkyRL	NovaSky-AI	github.com/NovaSky-AI/SkyRL	1,664
SLIME	THUDM	github.com/THUDM/slime	4,595
TorchForge	Meta	github.com/meta-pytorch/torchforge	632
Tunix	Google	github.com/google/tunix	2,175
verl	ByteDance	github.com/verl-project/verl	19,673
verifiers-rl	PrimeIntellect	github.com/PrimeIntellect-ai/verifiers	3,876

3. The Comparison Framework: Seven Axes

To make sense of the rapidly expanding ecosystem of async RL libraries, we propose seven orthogonal axes of comparison. Each axis captures a fundamental design decision that shapes the library’s performance, complexity, and trade-offs.

Axis 1 – Orchestration & Concurrency Primitive: how distributed components are coordinated (Ray actors, asyncio, pub/sub, HTTP).
Axis 2 – Rollout Buffer Design: how rollouts flow from inference to training.
Axis 3 – Weight Synchronisation Protocol: how updated weights reach inference servers, and whether the system must pause to just accept them or proceed generating.
Axis 4 – Staleness Management: how off-policy rollouts are handled : version rejection, depth bounding, or importance-sampling correction.
Axis 5 – Partial Rollout Handling: what happens to in-flight generations when a weight update arrives mid-sequence.
Axis 6 – LoRA Training Support: General LoRA support and whether adapter-only parameters will be trained and synced, enabling sub-millisecond weight transfers.
Axis 7 – Distributed Training Backend & Parallelism: what parallelism strategy is used for training, constraining max model size.

Axis 1: Orchestration & Concurrency Primitive

How does the system coordinate its distributed components?

The alternative of orchestration framework determines the programming model, failure semantics, and scalability ceiling. Relatively than listing per-library implementation details, the landscape decomposes cleanly into 4 orchestration types, fundamental coordination paradigms that differ in abstraction level, failure model, and deployment requirements:

Orchestration Type	What It Is	Libraries	Trade-offs
Distributed Actor Model	Components are actors, isolated stateful processes with mailboxes, managed by a runtime that handles scheduling, resource placement, fault tolerance, and object transfer. Communication is via asynchronous RPC / futures / object store.	Ray: verl, SkyRL, NeMo-RL, SLIME, MILES, ROLL, OAT, open-instruct. Monarch: TorchForge.	Richest abstraction; solves scheduling and fault tolerance out-of-the-box. Adds a non-trivial runtime dependency and framework-specific debugging overhead.
Native Python Concurrency	Components are threads, coroutines (`asyncio`), `threading` primitives, `multiprocessing` child processes, and queues. No external orchestration runtime.	verifiers-rl, PipelineRL (intra-pool), ART (`asyncio` + child-process proxies), AReaL (`asyncio`-based event loop)	Minimal dependencies, easy to debug, full control. Limited to single-node unless paired with additional IPC (Redis, HTTP, NCCL) for multi-node communication.
Pub/Sub Message Bus	Components are decoupled producers and consumers communicating through append-only streams or message queues. Not orchestration per se, a data transport layer between independently running pools.	PipelineRL (inter-pool: Redis `XADD`/`XREAD` streams for multi-node, append-only JSONL files for single-node)	Clean decoupling across pool boundaries without RPC. Doesn’t manage process lifecycle, scheduling, or fault recovery; should be paired with one other orchestration type.
HTTP Microservices	Components are independent services communicating via REST APIs. Language-agnostic, maximum decoupling.	Atropos	Any inference server, any language, zero shared state. Highest latency (if NCCL); no shared object store; fault tolerance is the user’s responsibility.

Note on Tunix: Tunix (Google) uses a JAX-native mesh model with ThreadPoolExecutor for async overlap and jax.device_put for cross-mesh weight transfer. It’s architecturally distinct enough from the PyTorch ecosystem that direct comparison on orchestration will not be meaningful; it lives within the XLA/TPU world with its own coordination primitives.

The table above reveals a striking pattern: eight of the sixteen libraries surveyed use Ray as their orchestration backbone. This will not be a coincidence; it reflects a deep architectural fit between the actor model and the structure of RL training. A survey by Anyscale (the corporate behind Ray) of open-source RL libraries for LLMs confirms this convergence. RL training at large scales involves fundamentally heterogeneous components (inference engines, training engines, environments, reward models) that should be orchestrated across a cluster, often on different hardware types, with different scaling requirements and failure modes. Ray’s actor model maps directly onto this:

Actor isolation and heterogeneous resources. Each RL component (vLLM inference server, FSDP trainer, reward model, environment pool) becomes a Ray actor with its own resource requirements (num_gpus, num_cpus, memory). Placement groups give fine-grained control over GPU affinity without manual SSH/torchrun orchestration.
Scheduling and autoscaling. Ray’s scheduler handles the combinatorial problem of placing heterogeneous actors across a cluster. When generation requires 8× more GPU-hours than training, you’ll be able to just tell Ray to scale your inference actors independently.
Fault tolerance. Long RL training runs (days to weeks) are vulnerable to GPU failures, OOM kills, and network partitions. Ray’s actor restart policies and object store replication provide resilience that might require significant custom infrastructure with raw asyncio and multiprocessing. Concrete example of the fault tolerance: open-instruct, for instance, relies on Ray’s actor supervision to get well from vLLM engine crashes mid-rollout.
Object store for zero-copy data transfer. Rollout data will be large, tens of GB per batch for very long-context reasoning. Ray’s shared-memory object store enables zero-copy transfer between actors on the identical node, avoiding serialization overhead that typically comes with multiprocessing.Queue approaches.
Ecosystem maturity. Ray has been battle-tested at scale since 2017, with production deployments on hundreds of GPUs. The debugging overhead is real (Ray Dashboard, distributed stack traces, placement group failures), but the choice, constructing equivalent coordination from scratch, is worse on the multi-node scale. That said, Ray is a heavy dependency: it pulls in its own scheduler, object store, and dashboard, adding operational complexity that not every team needs. This is strictly why libraries like PRIME-RL, PipelineRL, and AReaL opted for lightweight native-Python coordination (asyncio, threading, Redis streams) as a substitute — whenever you control the total stack and your deployment topology is fixed, the simplicity and debuggability of vanilla Python often outweigh the conveniences Ray provides.

The associated fee is a tough dependency on a non-trivial runtime. This trade-off will be worthwhile, especially for production-scale training (64+ GPUs, multi-day runs, complicated reward computation).

While Ray’s actor model is the important player on the sector, Monarch emerged as a brand new PyTorch-native distributed actor framework from Meta, purpose-built for GPU workloads. Like Ray, Monarch is predicated on the actor model; components are independent actors with mailboxes communicating via messages, but it surely is designed from the bottom up for the PyTorch/CUDA ecosystem quite than being a general-purpose distributed runtime.

Monarch offers several capabilities particularly relevant to async RL. An example implementation of async RL with Monarch (from the GPU Mode lecture series) demonstrates the architecture: generators, a replay buffer, and a trainer are modelled as Monarch actors, with the replay buffer absorbing latency variance from straggler rollouts and RDMA weight sync pushing updated parameters to generators without blocking training. The pattern is structurally equivalent to Ray-based designs (verl, SkyRL, open-instruct) but implemented with pure PyTorch-native primitives.

Axis 2: Rollout Buffer Design

How do generated rollouts flow from inference to training, and the way deep is the pipeline?

The buffer is the information structure sitting between generation and training. Its depth controls the utmost degree of asynchrony, and subsequently the utmost staleness.

Pattern	Depth	Libraries	Characteristic
No buffer (synchronous)	0	TRL (current), ART (gather-all-then-train)	Generation and training alternate strictly; zero staleness, maximum idle time
Double-buffer (one-step-ahead)	1	verifiers-rl, SLIME (async mode), MILES, OAT	Submit generation N+1 at the beginning of coaching step N; overlap exactly one batch
Bounded async queue	2–K	SkyRL, verl (fully async), NeMo-RL, ROLL, PRIME-RL, TorchForge, Tunix, open-instruct (`async_steps`), AReaL (`max_head_offpolicyness`)	Multiple batches in flight; staleness bounded by queue capability
Unbounded / stream	Unlimited	PipelineRL (Redis streams), SLIME (fully async mode), Atropos	Continuous generation; staleness bounded only by explicit version control

The double-buffer pattern is the only upgrade from synchronous to asynchronous training: it overlaps exactly one generation with one training step and introduces at most one step of policy lag !

Deeper queues, alternatively, improve throughput but require staleness management.

The buffer controls how much data is in flight. But data is barely half the equation. The opposite half is getting updated weights back to the inference servers before those rollouts go stale. That is where weight sync is available in!

Axis 3: Weight Synchronisation Protocol

How do latest model weights reach the inference servers after a gradient update?

Scope note: This axis focuses on disaggregated mode, where inference and training run on separate GPU pools, since that’s the deployment topology where async overlap (and subsequently weight sync design) actually matters. Colocated setups (same GPUs for each roles) are inherently synchronous and don’t face the transport/interrupt trade-offs discussed below.

That is essentially the most architecturally consequential axis. The protocol determines sync latency, interrupt granularity, and whether partial rollouts are possible.

There may be a critical distinction to make here: the transport mechanism and the interrupt model. Most libraries pause generation at a rough boundary, an HTTP request, a full batch, or perhaps a full training step, before initiating weight transfer. PipelineRL is the outlier: it never stops generating in any respect.

Transport mechanism:

Mechanism	Latency	Libraries
NCCL Broadcast	~100–500ms	PipelineRL, SkyRL, SLIME, MILES, ROLL, OAT, NeMo-RL, PRIME-RL, open-instruct, AReaL
NCCL + Bucketing	~20ms	verl
KV + Shared Memory	Low	TorchForge
Filesystem + HTTP	Medium	PRIME-RL, AReaL, ART
CUDA IPC (Zero-copy)	Very Low	NeMo-RL, MILES
JAX Cross-mesh	Low	Tunix
HTTP PUT	High	verifiers-rl
Filesystem + Restart	Very High	Atropos

Within the interrupt model, when does the generation pause to just accept latest weights?

That is where PipelineRL fundamentally diverges from every other library. Relatively than listing each library individually, the landscape collapses into five conceptual tiers, ordered from finest to coarsest interrupt granularity:

Interrupt Granularity	What Happens	Libraries
Never (In-flight per-forward-pass)	Sequences never stop. The burden swap happens between token decode steps (~1-10ms gap). Running sequences seamlessly proceed with latest weights.	PipelineRL, open-instruct (opt-in)
Per HTTP Request (Abort + Resync)	In-flight HTTP requests are aborted. Partial tokens are resubmitted with a prefix-resume mechanism or recycled for retry.	SkyRL, SLIME, MILES
Soft Pause (Drain in-flight)	No latest generation requests are accepted while in-progress ones finish naturally. Once drained, weights are synced and generation resumes.	PRIME-RL, AReaL, open-instruct (default), verl (async)
Per Training Step / Batch (Blocking)	Generation must fully complete. The trainer and inference engine take turns blocking one another.	NeMo-RL, ROLL, OAT, TorchForge, Tunix, verifiers-rl, Atropos

The “never-stop” tier is qualitatively different from all others: PipelineRL, for instance, hooks into the inference engine in order that the lock is acquired and released per transformer forward pass (one token step for one sequence). A weight update waits at most one forward pass (~few ms), swaps all parameters, and generation resumes immediately. Every other library stops generation at a coarser boundary, from one HTTP request (~tons of of ms) as much as a full batch boundary (~seconds).

Weight sync controls when latest weights arrive. But async training means rollouts are all the time being generated under some policy version, and that generating policy is perhaps several gradient steps behind the trainer. How libraries handle this policy lag is staleness management.

Axis 4: Staleness Management

How does the system handle the proven fact that generated rollouts may come from an older policy than the one being trained?

Once generation and training overlap, samples develop into off-policy. Three orthogonal strategies have emerged for managing this staleness, and most production systems mix multiple:

Strategy 1: Per-sample version rejection. Every sample is tagged with the integer policy version that generated it. At training time, samples whose version falls behind the present policy by greater than a threshold are hard-dropped before entering the loss computation. Easy and proper, but wastes the valuable compute spent generating discarded samples.

Strategy 2, Depth Bounding. The queue or buffer between generation and training has a bounded capability (or an explicit staleness gate), which architecturally limits how far behind any sample will be. This ranges from depth=1 (one-step-ahead double buffering, where staleness is not possible by construction) to explicit capability formulas tied to version gaps. No per-sample version tracking is required; the sure is enforced by the system’s pipeline depth.

Strategy 3, IS-weighted loss correction. Stale samples that reach the trainer are reweighted by the importance sampling ratio $frac{pi*{theta}(a mid s)}{pi*{text{old}}(a mid s)}$

These strategies are orthogonal: a system can use version rejection alone, depth bounding alone, IS correction alone, or any combination of them. Synchronous systems avoid the issue entirely by never overlapping generation and training.

Library	Version Rejection	Depth Bounding	IS Correction	Key Config / Notes
AReaL	❌	✅	⚠️	`max_head_offpolicyness` capability formula; optional `use_decoupled_loss` adds IS weight capped at 5.0
ART	—	—	—	Synchronous; all rollouts collected before training; no staleness by design
Atropos	❌	✅	❌	`max_batches_offpolicy`, ceiling on buffered batches
MILES	❌	❌	✅	TIS + OPSM
NeMo-RL	✅	❌	❌	`max_trajectory_age_steps`, per-sample version drop
OAT	❌	❌	✅	Clipped TIS ratio
open-instruct	❌	✅	⚠️	`async_steps` cap (default 1, production 8); optional `--truncated_importance_sampling_ratio_cap ρ` adds clipped TIS
PipelineRL	✅	❌	❌	`max_lag`, integer version tag per sample; drop if age exceeds threshold
PRIME-RL	✅	✅	✅	Full hybrid: `max_async_level` version gap + `max_off_policy_steps` cancellation + IPO trust-region IS
ROLL	❌	❌	✅	Richest IS suite: TIS, TOPR, CISPO, Kimi15, six off-policy loss variants
SkyRL	❌	✅	❌	`max_staleness_steps`, capability gate blocks latest rollouts when exceeded
SLIME	❌	❌	✅	TIS + OPSM (off-policy masking for partial rollouts)
TorchForge	✅	❌	❌	`max_policy_age`, per-sample version tag; hard drop
Tunix	❌	✅	❌	Bounded queue + sync per step; staleness structurally limited
verl	❌	❌	✅	Clipped TIS ratio; optional OPSM
verifiers-rl	❌	✅	❌	Depth=1 FIFO + sync every step; staleness not possible by construction

✅ = yes, ❌ = no, ⚠️ = optional / configurable, — = not applicable (synchronous)

Version rejection is easy and proper, but wastes compute when many samples are discarded.
IS correction preserves throughput at the price of gradient variance.
Depth bounding is the coarsest mechanism, but it surely avoids per-sample bookkeeping entirely.

The trend in production systems (PRIME-RL, AReaL, open-instruct) is toward hybrid approaches that mix depth bounding with optional IS correction, getting the architectural simplicity of bounded queues with the loss-level safety net of importance weighting for stable training.

Staleness management handles data that was generated under an old policy. But what about data that is still being generated when a weight update lands?

Axis 5: Partial Rollout Handling

What happens to a generation in progress when a weight update arrives?

That is critical for long-context tasks where a single rollout can take minutes. 4 strategies:

Strategy	Libraries	Description
Implicit continuation	PipelineRL	Sequences are never interrupted. Weights swap between forward passes; the sequence simply continues with latest weights. Stored logprobs remain valid because training uses the recorded $pi_{text{old}}$
Abort + retry with prefix	SkyRL, SLIME	Lively sequences are aborted. Partial tokens are gathered, then resubmitted with a prefix-resume mechanism using the brand new weights.
Explicit save/resume	verl (fully async)	The rollout employee saves partial token IDs and logprobs to a buffer, waits for sync, then resumes from the saved prefix.
Group cancellation (generation continues)	PRIME-RL	Stale rollout groups have their async tasks cancelled; the inference server continues serving in-flight HTTP requests whose results are discarded. Weight sync triggers between HTTP requests without interrupting mid-request generation.
No partial rollout support	verifiers-rl, OAT, Atropos, Tunix	Weight sync only happens at batch boundaries. In-flight generations must complete before sync begins.
Soft pause, in-flight sequences complete	AReaL	A pause signal blocks latest KV-cache allocations but doesn’t abort in-progress sequences. The duty dispatcher stops submitting latest tasks; running tasks run to completion. After weight sync, generation dispatch resumes.
Full sleep, no in-flight at sync time	ART	By design, training only begins in any case rollouts are collected. There are never in-progress sequences when sleep is triggered. Level-1 sleep (in-progress requests exist) offloads KV cache to CPU; level-2 sleep discards it entirely.
Drain-or-inflight (configurable)	open-instruct	Default: a stop flag gates latest prefetching; weight update waits for energetic tasks to empty. With in-flight updates enabled, drain is bypassed and weights broadcast while tokens are still being generated; sequences in progress proceed with a mixture of old and latest weights.

To this point, every axis has assumed full-parameter training. But in LoRA training, you are only training a couple of million adapter parameters as a substitute of billions, the load sync problem nearly disappears. Let us take a look at how these libraries support LoRA training.

Axis 6: LoRA Training Support

Does the library support parameter-efficient training via LoRA adapters, in what modes, and does it exploit adapter-only weight sync?

LoRA is arguably essentially the most practically consequential axis for teams with limited GPU budgets. It reduces the trainable parameter count by 99%+, halves peak activation memory, and, when the inference server is LoRA-aware, enables adapter-only weight sync: as a substitute of broadcasting every parameter of a 7B+ model (~100–500ms NCCL), only the adapter deltas are pushed to vLLM, which at rank 32 amounts to ~50 MB, a sub-millisecond transfer.

Library	LoRA Supported	Mode Restriction	LoRA Backend	Adapter-Only Sync
AReaL	✅ Yes	FSDP2 only (not Megatron/Archon)	HF `peft`	✅ Yes (disk-based sync; only trainable params transferred; vLLM adapter hot-swap)
ART	✅ Yes (primary design)	Each (shared + dedicated GPU)	Unsloth/`peft` (default); custom Megatron LoRA	✅ Yes (only adapter saved/loaded; in-process or HTTP adapter hot-swap; base weights never moved)
Atropos	✅ Yes	Disaggregated	HF `peft`	✅ Yes (`lora_only` / `lora_restart` modes)
MILES	✅ Yes	Each (colocated + disaggregated)	Megatron-Bridge	✅ Yes (adapter sync config for SGLang)
NeMo-RL	✅ Partial*	Each	Custom (not `peft`)	❌ No evidence
OAT	✅ Yes	Each	HF `peft`	✅ Yes (LoRA-only sync mode)
open-instruct	⚠️ Code exists, not wired‡	—	HF `peft` (SFT/DPO only)	❌ No (LoRA not applied within the RL trainer)
PipelineRL	✅ Yes	Non-colocated	HF `peft`	❌ No (full NCCL broadcast)
PRIME-RL	✅ Yes	Disaggregated	Custom MultiLoRA (not `peft`)	✅ Yes (adapter-only state dict extraction)
ROLL	✅ Partial†	DeepSpeed backend only	HF `peft` / TRL	❌ No evidence
SkyRL	✅ Yes	Each	`peft` (FSDP) / Megatron-Bridge (Megatron)	✅ Yes (filesystem-based adapter sync)
SLIME	❌ No	—	—	❌ No
TorchForge	❌ No	—	—	❌ No
Tunix	✅ Yes	Each	qwix (JAX-native)	✅ Yes (auto-detected)
verl	✅ Yes (most complete)	Each	`peft` (FSDP) / Megatron-Bridge (Megatron)	✅ Yes (unmerged adapter sync)
verifiers-rl	✅ Yes (via prime-rl)	Disaggregated	HF `peft` + FSDP2 + vLLM	✅ Yes (vLLM LoRA serving)

* NeMo-RL: LoRA for GRPO and DPO is supported only on the DTensor backend; the Megatron Core backend is SFT-only (RL LoRA listed as “coming soon”). Uses a custom DTensor-compatible LoRA module (not peft), optionally with Triton kernels.

† ROLL: LoRA is officially supported with the DeepSpeed training backend only. Megatron-backend LoRA appeared within the Feb 2026 changelog but stays experimental.

‡ open-instruct: The model config exposes LoRA-related fields (use_peft, lora_r, lora_alpha), and adapter saving is handled within the checkpoint logic. Nonetheless, the peft model is rarely initialised within the RL training path; LoRA stays an SFT/DPO-only feature for the RL trainer as of March 2026.

Three LoRA implementation families:

HuggingFace peft (PipelineRL, SkyRL/FSDP, verifiers-rl, ROLL, OAT, Atropos): Probably the most common alternative. Standard checkpoint format (adapter_model.safetensors), compatible with any HF Transformers training loop. ZeRO-3 interactions require care: OAT, for instance, must disable the fused LM head; ROLL must disable gradient checkpointing entirely.
Megatron-Bridge (verl/Megatron, SkyRL/Megatron, MILES): Required for 3D-parallel training (TP × PP × DP). Supports multiple LoRA types: lora, canonical_lora (splits merged QKV → separate Q/K/V adapters), vlm_lora, and dora. The canonical_lora variant avoids the QKV merge, thereby improving training stability. MILES saves checkpoints in each HF peft format and Megatron-native per-rank format.
Custom implementations (NeMo-RL, PRIME-RL, Tunix/qwix): Library-specific LoRA modules not interoperable with peft checkpoints. PRIME-RL uniquely supports multiple simultaneous adapters in a single run for multi-experiment parallelism. Tunix uses Google’s qwix JAX library, which adds built-in QLoRA (NF4 quantization) and TPU-native gradient routing. NeMo-RL uses a custom DTensor-compatible module with an optional Triton fused kernel.

The adapter-only weight sync opportunity (interaction with Axis 3):

Eight of the thirteen libraries support pushing only the LoRA adapter deltas to the inference server. This changes the character of the load sync problem (Axis 3) entirely. When using full-parameter training, the interrupt model (per-forward-pass lock vs. per-request abort vs. per-batch pause) determines how much generation is wasted during an NCCL broadcast. When using LoRA with adapter-only sync, the transfer is so small that just about any interrupt model delivers equivalent throughput! Even Atropos’s brute-force HTTP hot-swap becomes viable.

Axis 7: Distributed Training Backend & Parallelism

What parallelism strategy does the library use for training, and the way does this constrain or enable the async architecture?

This axis cuts across every other axis. The alternative of coaching backend determines how large a model can fit per GPU, what number of collective operations are needed to collect weights before broadcasting to inference servers, and which model architectures will be trained in any respect. It’s the one most consequential decision for teams scaling beyond 30B parameters or moving from dense to Mixture-of-Experts models.

Library	Training Backend	Parallelism	HF Model Loading	MoE / EP Support
AReaL	FSDP2, Megatron, Archon	DP, SP, TP, PP, CP, EP	✅ Direct / Convert	✅
ART	Unsloth, Megatron	DP, TP, EP	✅ Direct / Convert	✅
Atropos	PyTorch Native, TRL	DP	✅ Direct	❌
MILES	Megatron, FSDP2	DP, TP, PP	🔄 Convert	✅
NeMo-RL	FSDP2, Megatron	DP, SP, TP, PP, CP, EP	✅ Direct / Convert	✅
OAT	DeepSpeed	DP, TP	✅ Direct	❌
open-instruct	DeepSpeed	DP, SP	✅ Direct	❌
PipelineRL	DeepSpeed	DP, SP	✅ Direct	❌
PRIME-RL	FSDP2	DP, TP, CP, EP	✅ Direct	✅
ROLL	DeepSpeed, Megatron, FSDP2	DP, SP, TP, PP, CP, EP	✅ Direct / Convert	✅
SkyRL	FSDP, Megatron	DP, SP, TP, PP, EP	✅ Direct / Convert	✅
SLIME	Megatron	DP, TP, PP, SP	🔄 Convert	✅
TorchForge	FSDP2	DP, TP, CP	✅ via TorchTitan	❌
Tunix	JAX/XLA	DP, TP	❌ Custom Flax	❌
verl	FSDP, Megatron	DP, SP, TP, PP, CP, EP	✅ Direct / Convert	✅
verifiers-rl	DeepSpeed	DP	✅ Direct	❌

The training backend creates direct implications for async RL library design:

Weight sync speed is a direct function of the training backend, and faster sync means less staleness.

In a disaggregated async setup, weight sync does not necessarily stall inference. The important thing design decision is how the load update interacts with in-flight generation; 4 strategies exist, ordered from least to most disruptive:

Atomic swap, no interruption. The complete weight update is dispatched as a single blocking RPC to the inference engine. Each forward pass sees either all old or all latest weights, never a mixture. Generation pauses for at most one forward-pass gap (~few ms). (PipelineRL)
Per-parameter streaming, no interruption. Each parameter is distributed as a separate RPC + NCCL broadcast. Forward passes interleave between individual parameter updates, so in-flight sequences genuinely see a mixture of old and latest weights across layers. Maximum overlap, but weakest consistency. (open-instruct, inflight mode)
Dispatch gate, drain in-flight, then sync. Latest requests are held back while in-progress sequences complete naturally; weights are broadcast only after the pipeline drains. No wasted tokens, but a sync bubble proportional to the longest in-flight sequence. (PRIME-RL, AReaL, open-instruct default, verl fully-async)
Hard pause or abort. Inference is paused, or in-flight requests are aborted before weight transfer begins. Cleanest consistency, highest wasted compute. (verl, SkyRL)

But even in libraries where inference continues, slower sync means longer periods where inference runs on stale weights. The policy version gap between the trainer and the inference pool grows with sync duration. Something to bear in mind.

**MoE support is an increasingly necessary differentiator as the sector moves toward sparse models.**
The trend is evident: frontier models are sparse (DeepSeek-V3, Qwen3-MoE, Mixtral, DBRX), and open-weight MoEs have gotten the default place to begin for post-training. Training these models requires Expert Parallelism (EP), distributing different experts to different ranks, which most async RL libraries don’t support. Only Megatron-backed libraries (verl, SLIME, MILES, ROLL, NeMo-RL) and PRIME-RL’s FSDP2+EP path handle EP accurately. ZeRO-based libraries (PipelineRL, verifiers-rl, OAT, open-instruct) can load MoE HuggingFace model classes, but without EP each expert is sharded across all ZeRO-3 ranks quite than being placed on a dedicated rank; every forward pass AllGathers every expert, negating the sparsity advantage entirely. EP also complicates weight sync: before broadcasting to vLLM/SGLang (which usually serves all experts from a single TP group), the trainer must AllGather expert parameters from every EP rank, an $O (N * experts \times E * size) O(N*{text{experts}} times E*{text{size}})$

**MoE LoRA is an emerging requirement, and a difficult one.**
LoRA on dense models is well-understood (Axis 6): attach adapters to attention projections, train them, sync only the adapter deltas. MoE LoRA is harder since the natural goal is the expert FFN layers, meaning each expert gets its own adapter. For a model with 64 experts and rank-32 LoRA on each expert’s gate/up/down projections, the adapter count jumps from ~20 (dense) to ~200+ (MoE), and the adapters are distributed across EP ranks. Weight sync must gather adapters from every EP rank before pushing them to the inference server, a coordination problem that doesn’t exist for dense LoRA. Among the many surveyed libraries, only ART explicitly implements MoE expert LoRA layers (Megatron EP path with per-expert LoRA and manual allreduce), and MILES supports LoRA via Megatron-Bridge, which might goal expert layers. verl’s Megatron-Bridge path supports LoRA types including vlm_lora, but MoE-specific expert LoRA will not be documented. vLLM’s LoRA serving doesn’t natively support per-expert adapters; it loads a single adapter applied uniformly, so adapter-only sync for MoE LoRA currently requires custom inference-side logic. As MoE models develop into the default for post-training, MoE LoRA with efficient adapter-only sync will likely be a key capability gap to shut.

That covers the seven axes, each captures a unique facet of the identical underlying problem. Together, they offer us a whole lens for comparing libraries. Time to place all of it on one page.

4. Global Overview: Sixteen Libraries at a Glance

Note: This overview reflects the state of those libraries as of March 2026. The ecosystem is evolving rapidly; specific features, backends, and integrations may change within the near future.

Library	Org	Orchestration Type	Inference Server	Weight Sync	Staleness Management	Partial Rollout	Training Backend	Dist. Parallelism	LoRA Support
AReaL	inclusionAI	Native Python (asyncio + HTTP RPC); pluggable Ray/Slurm	vLLM, SGLang	NCCL chunked OR filesystem safetensors	Depth + IS (optional)	🟧 Soft pause (in-flight complete)	FSDP2 or Megatron-LM or Archon	FSDP2: DP+SP+TP; Megatron: TP+SP+PP+CP+EP; Archon: FSDP2+TP+SP+PP+EP	✅ `peft` (Adapter-only)
ART	OpenPipe	Native Python (asyncio + mp child processes)	vLLM	LoRA adapter swap (no full weight transfer)	Synchronous (none)	❌ No	Unsloth (single-GPU); Megatron-LM	None (Unsloth); TP×EP×DP (Megatron)	✅ `peft` / Megatron LoRA (Adapter-only)
Atropos	NousResearch	HTTP Microservices (FastAPI)	vLLM, SGLang, OpenAI	FS checkpoint + vLLM restart	Depth bounding	❌ No	Single-GPU PyTorch; TRL/Speed up	None (native); FSDP/ZeRO via TRL adapter	✅ `peft` (Adapter-only)
MILES	radixark	Distributed Actor (Ray)	SGLang	NCCL OR CUDA IPC	IS correction	🟧 Abort + recycle to buffer	Megatron-LM (primary); FSDP2	Megatron: TP×PP×DP; FSDP2 available; colocated CUDA IPC	✅ Megatron-Bridge (Adapter-only)
NeMo-RL	NVIDIA	Distributed Actor (Ray)	vLLM, SGLang, Megatron	NCCL OR CUDA IPC-ZMQ OR HTTP	Version rejection	✅ In-flight continuation	DTensor (FSDP2+TP) or Megatron-Bridge	DTensor: TP+SP+CP+FSDP2; Megatron: TP×PP×CP×EP×ETP + FSDP2	🟧 Custom (No adapter-only sync)
OAT	SAIL-SG	Distributed Actor (Ray)	vLLM	NCCL per-param + ZeRO-3 gather	IS correction	❌ No	DeepSpeed ZeRO-2/3	ZeRO-2 / ZeRO-3 DP; AutoTP	✅ `peft` (Adapter-only)
open-instruct	AI2 (AllenAI)	Distributed Actor (Ray)	vLLM	NCCL broadcast; optional in-flight updates	Depth + IS (optional)	🟧 Drain-or-inflight (configurable)	DeepSpeed ZeRO-0/2/3	ZeRO-3 DP + Ulysses SP; vLLM TP (inference only)	❌ No
PipelineRL	ServiceNow	Native Python + Pub/Sub (asyncio + Redis/JSONL)	vLLM	NCCL pg + HTTP notify	Version rejection	✅ Implicit continuation	DeepSpeed ZeRO-3	ZeRO-3 DP + Ring SP; ZeRO++ available	✅ `peft` (Full sync)
PRIME-RL	PrimeIntellect	Native Python (asyncio + FS/ZMQ)	vLLM	Filesystem safetensors + HTTP OR NCCL	Version + depth + IS	🟧 Group cancellation	FSDP2 (exclusively)	FSDP2 per-block + TP + CP + EP; pp=1	✅ Custom MultiLoRA (Adapter-only)
ROLL	Alibaba	Distributed Actor (Ray)	vLLM, SGLang	NCCL via dedicated update group	IS correction	❌ No	DeepSpeed ZeRO or Megatron or FSDP2	DS: ZeRO+Ulysses SP; Megatron: TP×PP×CP×EP; FSDP2: HSDP+Ulysses	🟧 `peft` (DeepSpeed only)
SkyRL	NovaSky-AI	Distributed Actor (Ray) + Native Python	vLLM, SGLang	NCCL pg	Depth bounding	🟧 Abort + retry with prefix	FSDP/FSDP2 or Megatron-Bridge	FSDP: ZeRO shard + Ulysses SP; Megatron: full 5D via bridge; JAX backend	✅ `peft` / Megatron-Bridge (Adapter-only)
SLIME	THUDM	Distributed Actor (Ray)	SGLang	NCCL pg, bucketed	IS correction	🟧 Abort + recycle to buffer	Megatron-LM	TP×PP×DP; Megatron→HF conversion; MoE EP all-gather	❌ No
TorchForge	Meta	Distributed Actor (Monarch)	vLLM	torchstore + shared memory prefetch	Version rejection	❌ No	FSDP2 via TorchTitan	FSDP2 + TP; CP partial; PP not yet implemented	❌ No
Tunix	Google	Native Python (ThreadPoolExecutor + asyncio); JAX-native	vLLM, SGLang, JAX	Cross-mesh reshard	Depth bounding	❌ No	JAX/XLA 2D mesh	2D JAX mesh: FSDP + TP; no PP; TPU-primary	✅ qwix / QLoRA (Adapter-only)
verl	ByteDance	Distributed Actor (Ray)	vLLM, SGLang	NCCL + checkpoint-engine buckets	IS correction	✅ Explicit save/resume	FSDP1/FSDP2 or Megatron-Core	FSDP: ZeRO-2/3/HSDP + Ulysses SP; Megatron: TP×PP×VPP×CP×EP×ETP	✅ `peft` / Megatron-Bridge (Adapter-only)
verifiers-rl	PrimeIntellect	Native Python (threading + asyncio)	vLLM	PyNCCL broadcast	Depth bounding (depth=1)	❌ No	DeepSpeed ZeRO-3 (Speed up)	ZeRO-3 DP only; no TP/PP	✅ `peft` (Adapter-only)

That is the present state of play. But the sector is moving fast, and several other emerging trends are about to stress-test these architectures in ways their designers may not have anticipated.

5. The Next Wave: Design Implications

The trends below will not be a list of latest techniques; each creates concrete pressure on the infrastructure and algorithmic decisions made today. The query will not be “what’s the frontier?” but “if this trend wins, what breaks in my current stack?”

5.1 Critic-Free Algorithms: Memory Freed, But Weight Sync Pressure Increases

PPO’s value network doubles the memory footprint of any training node. The sphere is converging on critic-free variants (GRPO, REINFORCE++, Online DPO) precisely because long CoT reasoning makes this overhead prohibitive at 8K–64K context lengths.

What this unlocks: Eliminating the critic frees ~50% of coaching GPU memory. This slack will be reallocated to: (a) larger rollout batches, directly reducing the straggler variance problem, or (b) co-locating inference and training on the identical GPUs, which eliminates the necessity for a separate NCCL weight sync process group entirely.

What it doesn’t solve: Critic-free methods still require frequent weight pushes to inference servers. In reality, they will increase sync pressure: with no value network to offer a stable baseline, GRPO-style algorithms require larger group sizes (G=8–32) to get low-variance advantage estimates, which implies more rollouts per step and faster policy drift. Libraries that sync only at coarse boundaries (per training step or per K steps) will see staleness grow faster under critic-free training.

Asymmetric trajectory filtering (GRPO-RoC: oversample rollouts, strictly filter positives, uniformly downsample negatives; CISPO/DAPO-style asymmetric clipping in DeepSeek-V3.2 and MiniMax-M1) has a subtler impact on staleness. The problem will not be the batch shrinking per se; it’s the composition of the surviving batch. Positive trajectories (correct solutions to easy prompts) converge faster and are retained preferentially; harder prompts yield mostly negative trajectories which might be discarded. The result: the samples that survive filtering are systematically older than the common rollout within the buffer, because the simple prompts they solve were issued earlier in training. A buffer stuffed with nominally “fresh” rollouts can contain surviving positives spanning a big selection of policy versions. Admission control that tracks staleness on the batch level (e.g., SkyRL’s max_staleness_steps capability gate, Atropos’s max_batches_offpolicy) cannot detect this intra-batch version spread. Per-sample version tagging (Axis 4) will not be optional on this regime; the trainer must have the opportunity to reject or IS-correct individual samples whose policy version diverges too far, even when the batch they belong to was admitted recently.

Critic-free methods simplify the training side. However the scoring side is about to get costlier: process reward models rating intermediate reasoning steps, not only final answers, and that introduces an entire latest synchronisation bottleneck.

5.2 Process Rewards: A Latest Synchronisation Barrier

Final result reward is scalar and low cost, one call to a verifier at the tip of a rollout. Process reward models (PRMs) rating intermediate steps, which require either (a) a separate PRM forward omit the total reasoning trace, or (b) a web based utility function computed token-by-token during generation.

PRPO (entropy-spike segmentation with PRM scoring per segment) and DEEP-GRPO (pivot identification via online utility) each incur computational overhead between generation and training. In the present library landscape, this phase maps awkwardly onto the preprocessor pool (PipelineRL) or requires a further Ray actor (verl, NeMo-RL). Neither is designed for it.

The important thing implication: PRM-based credit project breaks the idea that rewards are low cost to compute. A PRM forward omit a 32K-token reasoning trace from a 7B model will be very costly. At G=8 completions per prompt, the reward computation could eat non-negligible wall time relative to the generation itself. Two consequences:

Async reward pipelines develop into crucial. PRIME-RL runs reward scoring concurrently with training as a part of its fully async Orchestrator-Trainer pipeline; the Orchestrator handles scoring while the Trainer performs backward and optimizer steps independently. For PRM-based methods, this pipelined reward computation will not be optional; synchronous reward scoring will dominate training wall time.
The separate preprocessor pool becomes crucial. Running reference logprobs computation and PRM scoring on a dedicated GPU tier, for instance, pipelined between generation and training, is the proper architecture for dense credit project.

DEEP-GRPO’s pivot resampling introduces a third-generation pattern alongside standard rollouts and partial rollout resumes: local resampling from a mid-sequence state. This requires saving KV cache state at pivot points, which no current async library supports out of the box. Weight sync at pivot boundaries may very well be a brand new correctness requirement: if weights change between the pivot generation and the local resample, the advantage estimate is corrupted. We will, in fact, recompute the KV-cache in a single prefill, but it surely could waste precious compute in our training.

5.3 Multi-Agent Co-Evolution: The Straggler Problem Compounds

Single-agent GRPO trains one policy generating G completions per prompt. Emerging, multi-agent self-play means the effective “group” spans multiple model invocations sequentially chained. The reward is barely available in any case models within the chain are complete.

Straggler dynamics change qualitatively. In single-agent GRPO, the straggler is the longest completion in a bunch, a tail event in a unimodal length distribution. In multi-agent pipelines, the straggler is the product of two or more length distributions. In a Proposer/Solver multi-agent architecture, if each has a ninetieth percentile completion time (5× the median), the joint ninetieth percentile is roughly 25× the median.

RL on swarms of agents implies a brand new unit of labor. Today, the atomic unit in every library is a single (prompt, completion, reward) triple. In multi-agent training, the atomic unit becomes an episode, a directed graph of turns, tool calls, and inter-agent messages. Buffer design, staleness tracking, and advantage computation all must operate over episodes. Replaying or forking episodes is also.

Straggler problems across agents are bad enough when the model is at the least internally consistent. With MoE architectures, even a single model can disagree with itself across inference and training frameworks and this raises a brand new set of emerging problems in RL training.

5.4 Training-Inference Mismatch: The Deepseek v3.2 MoE Case Study

The training-inference mismatch problem is endemic in async RL; anytime rollout data is generated under policy $π * old pi*{text{old}}$

Source 1: MoE expert routing inconsistency. Mixture-of-Experts models activate a sparse subset of experts per token. Inference frameworks (vLLM, SGLang) and training frameworks (Megatron, FSDP) implement the router independently, and differences in floating-point rounding within the gating function can result in different expert selections for equivalent inputs. When expert routing diverges, the energetic parameter subspace shifts discontinuously; a gradient step computed assuming Expert A was energetic is applied to weights which might be energetic under Expert B. DeepSeek-V3.2 found this “induces abrupt shifts within the energetic parameter subspace, which destabilizes optimization and exacerbates off-policy issues.”

Their solution, Keep Routing, preserves the precise expert routing paths used during sampling (inference) and enforces those paths through the training forward pass. This requires the inference framework to record and return routing decisions alongside token logprobs, and the training framework to just accept and implement them. No current open-source async RL library implements this. For any team training MoE models (DeepSeek-V3 class, Mixtral, future open MoEs), this can be a correctness issue, not a performance issue.

Source 2: Sampling truncation mask mismatch. Top-p and top-k sampling truncate the vocabulary at generation time, excluding low-probability tokens from the sampling distribution. During training, the total vocabulary is visible to $π * θ pi*{theta}$

DeepSeek-V3.2’s Keep Sampling Mask solution: record the truncation mask during sampling and apply it to $pi_{theta}$

Implications for library design: Each Keep Routing and Keep Sampling Mask require the inference server to return additional metadata alongside token logprobs, routing decisions, and sampling masks. The present API contract between inference servers (vLLM, SGLang) and trainers is (token_ids, logprobs, finish_reason). Extending this to (token_ids, logprobs, finish_reason, expert_routing, sampling_mask) is a breaking change to each library’s data flow.

5.5 Distillation: The Same Async Problem Under a Different Name

On-policy distillation, where a student model generates sequences, and a teacher model scores them with token-level logprobs, is structurally the identical because the async coordination problem in GRPO.

Every design axis on this survey, rollout buffers, weight sync protocols, staleness management, and partial rollout handling, applies identically to distillation. The generation pool produces student rollouts, the teacher scores them (replacing the verifier), and the trainer computes a backward pass with either an advantage-modified GRPO loss or a standalone KL objective. Self-distillation adds yet one more coordination requirement: the teacher is a frozen snapshot of the coed from step N−k, so the system must periodically checkpoint the policy and hot-swap the teacher server without disrupting the pipeline, a primitive that no library has fully automated.

The sensible implication for library design is that async RL infrastructure shouldn’t be built as a GRPO-specific system. The generation–scoring–training pipeline is a general pattern that covers RL with consequence rewards, RL with process rewards, on-policy distillation, and self-distillation. Libraries like SLIME, MILES, PRIME-RL, AReaL, and NeMo-RL already support each GRPO and on-policy distillation precisely because their async scaffolding treats the reward/scoring phase as a pluggable component quite than a hardcoded verifier call. Any async trainer that aspires to generality should do the identical: define the scoring phase as an interface (an HTTP endpoint, a Ray actor, or a co-located forward pass), and let the buffer, staleness, and weight-sync machinery operate identically no matter what fills it.

6. Design Decisions for TRL’s Async Trainer

Having surveyed the total landscape, orchestration models, buffer designs, weight sync protocols, staleness strategies, and partial rollout handling, we are able to now lay out concrete design decisions for an async trainer in TRL, together with the future-proof directions we intend to explore.

Design Principle: Keep Orchestration Lightweight

One among the strengths of the present TRL implementation is that it doesn’t rely upon a heavy orchestrator system to administer the training lifecycle. Data contained in the library stays native Python objects without external-library coloring. We wish to preserve this: orchestration should stay so simple as possible, with no dependency on heavyweight external frameworks.

1. Bounded Queue with Per-Token `model_version` (No Double-Buffering)

Relatively than starting with double-buffering and graduating to something more granular, we go straight to a bounded queue where every token is tagged with the model_version that produced it. That is the bottom possible granularity from the beginning; it enables importance-sampling correction on the token level, supports easy admission gating (drop or down-weight tokens beyond a staleness threshold), and avoids the architectural debt of retrofitting token-level provenance onto a batch-level buffer later.

2. NCCL Weight Sync with Packed Transfers

NCCL process groups are a necessity, and we already use them. Adding bucketing needs to be the subsequent step as vLLM’s NCCLWeightTransferEngine with packed=True directly supports bucketed broadcast: it packs parameters into configurable-size uint8 buffers (default 1 GB, double-buffered across CUDA streams) and broadcasts them via a dedicated NCCL communicator separate from the training process group. This eliminates the per-parameter call overhead that dominates naive broadcast, yielding a large sync speedup.

Beyond vLLM’s built-in engine, we are going to explore high-performance weight packing libraries for more demanding scenarios:

Awex (inclusionAI), a dedicated weight synchronization framework for RL training that handles the hard problem of cross-engine transfer: training engines (Megatron, DeepSpeed) and inference engines (SGLang, vLLM) use completely different parallelism strategies and tensor layouts. Awex abstracts this behind a unified conversion layer and deterministic P2P transfer plans. It supports each separated-GPU and co-located (CUDA IPC) modes.
Mooncake Transfer Engine, SGLang has moved toward integrating the Mooncake Transfer Engine as its high-performance transport layer, with integrations spanning PD disaggregation, hierarchical KV caching, and elastic expert parallelism. For weight sync specifically, the companion checkpoint-engine project uses Mooncake’s RDMA-backed P2P transfers to update trillion-parameter models (Kimi-K2, 256×H20 GPUs) in ~16–17 seconds. Mooncake is now a part of the PyTorch Ecosystem and likewise serves as a backend plugin for NVIDIA’s NIXL transfer library.

3. Partial Rollout Support for Agentic Workloads

Multi-turn tool-use tasks in complex environments can take minutes per rollout. With no mechanism to handle in-flight rollouts during weight updates, sync windows develop into pipeline stalls. We’ll probably explore two strategies experimentally :

Prefix-resume: when weights update mid-rollout, save the KV cache prefix and resume generation from the checkpoint under the brand new policy. This preserves partial work but requires support from the inference engine for mid-sequence weight swaps.
Abort-and-retry: discard in-flight rollouts that exceed a staleness threshold and re-queue the prompt. Simpler to implement, but wastes compute proportional to the common rollout length on the time of abort.

That is the map, stay tuned, we’re working on a concrete async GRPO trainer in TRL, and we’ll announce it shortly 🧑‍🍳!

Source link

Lessons from 16 Open-Source RL Libraries