We’re releasing TRL v1.0, and it marks an actual shift in what TRL is. What began as a research codebase has grow to be a dependable library people construct on, with clearer expectations around stability.
This is not only a version bump. It reflects the truth that TRL now powers production systems, and embraces that responsibility.
TRL now implements greater than 75 post-training methods. But coverage isn’t the goal by itself. What matters is making these methods easy to try, compare, and really use in practice.
The design of the library wasn’t decided upfront. It’s the results of years of iteration — the primary commit goes back greater than six years — and it has been shaped by all the pieces the sector threw at it: recent algorithms, recent models, shifting paradigms. Over time, this pressure forced the codebase toward a really specific design. Parts of it’d look unusual at first, but like in lots of evolutionary codebases, they exist for a reason.
TRL is built for a field that doesn’t sit still. So the query will not be methods to design the right abstraction. It’s methods to make stable software in a site that keeps invalidating its own assumptions. That is what we tried to unravel in TRL v1.0, and this post explains how.
1. A moving goal: post-training as a shifting field
Post-training has not evolved as a smooth refinement of 1 recipe. It has moved through successive centers of gravity, each changing not only the target, however the shape of the stack.
PPO [Schulman et al., (2017); Ziegler et al., (2019)] made one architecture look canonical: a policy, a reference model, a learned reward model, sampled rollouts, and an RL loop.
Then DPO-style methods akin to the unique DPO [Rafailov et al., (2023)], ORPO [Hong et al., (2024)], and KTO [Ethayarajh et al., (2024)] cut through that stack: preference optimization could work with no separate reward model, value model, or any online RL. Components that had looked fundamental suddenly looked optional.
RLVR-style methods akin to GRPO [Shao et al., (2024)] shifted the middle again. On tasks like math, code, and gear use, rewards often come from verifiers or deterministic checks relatively than learned reward models. Sampling and rollouts matter again, however the objects within the loop are not any longer quite those PPO libraries were designed around.
The lesson will not be just that methods change. The definition of the core keeps changing with them. Strong assumptions here have a brief half-life. This might be why no post-training library is basically stable yet.
2. From project to library: TRL has a chaos-adaptive design
So what does it mean to construct a library for a field that will not sit still? The reply is counterintuitive: don’t attempt to capture the essence of what is stable today. As an alternative, design around what could change. Reward models illustrate why: they looked essential in PPO, became optional in DPO, and got here back as verifiers in RLVR methods — structures that could possibly be deterministic functions relatively than learned models. Any abstraction built around their original form would have been obsolete twice over by now. The library survives by recognizing that strong assumptions have a brief life, and by making that changeability central to how the codebase is organized.
That is the environment wherein TRL is downloaded 3 million times a month, and wherein major downstream projects treat it as stable infrastructure. The sphere keeps shifting the bottom, and at the identical time, those users need things not to interrupt.
A shift in nature: from code to contract
TRL didn’t make a deliberate decision to grow to be a library. It came upon it already was one. Projects like Unsloth and Axolotl — with hundreds of users between them — had built directly on top of TRL’s trainers and APIs. A breaking change in TRL propagated immediately into their stacks. A renamed argument, a shifted default, a restructured output — any of those became another person’s incident. The shift had already happened. v1.0 is the moment TRL acknowledged it explicitly.
Stable and experimental, under the identical roof
The bizarre thing about TRL’s stability model will not be what it guarantees, it’s what it tolerates alongside those guarantees. Stable and experimental coexist inside the same package, with explicitly different contracts. The stable core follows semantic versioning. The experimental layer makes no such guarantees — it’s where recent methods land while they’re still being evaluated, and where the API can move fast to maintain up with the sector.
This isn’t a compromise. It’s a response to a selected constraint: the sector produces recent methods faster than any of them can earn stability. Refusing so as to add immature methods would make TRL irrelevant inside months. Adding all of them to stable would break every downstream project each time an algorithm turned out to not work as expected.
from trl import SFTTrainer
from trl.experimental.orpo import ORPOTrainer
Promotion from experimental to stable isn’t automatic. What matters is the ratio between maintenance cost and actual usage. Some methods earn their place since the community uses them heavily. Others grow to be viable because we are able to make them low-cost enough to keep up — and the design of the codebase is what makes that possible.
In practice, the stable surface includes trainers for SFT, DPO, Reward modeling, RLOO, and GRPO, together with their close variants. The experimental surface is broader and moves faster; for an up-to-date view, one of the best reference is the TRL documentation.
The breaking changes needed to achieve v1.0 were distributed deliberately across the 0.x releases. Migration from the last 0.x version is minimal — see the migration guide.
Deliberately limiting abstractions
In a site where patterns keep changing, the temptation is to construct flexible abstractions that may accommodate anything. Our answer was the alternative: limit abstractions to the strict minimum — while recognizing that this “minimum” is nearly all the time overestimated.
In practice, this translates right into a very local approach to code:
- avoid generic class hierarchies
- favor explicit implementations
- accept, and even encourage, duplication
The goal will not be to eliminate structure altogether — shared utilities still exist — but to avoid imposing abstractions where the domain itself will not be yet stable. For example, relatively than defining a typical base class for offline trainers, we prefer independent implementations when their future evolution is uncertain.
class OfflineTrainer(Trainer):
def some_common_method(self): ...
class DPOTrainer(OfflineTrainer): ...
class KTOTrainer(OfflineTrainer): ...
class DPOTrainer(Trainer):
def some_common_method(self): ...
class KTOTrainer(Trainer):
def some_common_method(self): ...
One other example:
class TRLCollator: ...
class DPOTrainer:
def __init__(self, ...):
self.collator = TRLCollator(...)
class KTOTrainer:
def __init__(self, ...):
self.collator = TRLCollator(...)
class DataCollatorForPreference: ...
class DPOTrainer:
def __init__(self, ...):
self.collator = DataCollatorForPreference(...)
class DataCollatorForUnpairedPreference: ...
class KTOTrainer:
def __init__(self, ...):
self.collator = DataCollatorForUnpairedPreference(...)
Judges are an excellent example of what happens after we don’t follow this principle. Early on, we introduced a Judge abstraction to unify the varied ways of evaluating model outputs. It looked reasonable on the time. In practice, it was never really used — the abstraction didn’t match how people actually approached evaluation, and it added indirection without adding value. It still lives within the repo, mostly as legacy. In hindsight, shipping the concrete implementations without the unifying abstraction would have served users higher.
More explicit, but more adaptable
This approach favors explicit and modifiable usage over rigid frameworks: less magic, but more control. It comes with an obvious cost: code duplication. While often seen as an anti-pattern, on this context it has proven not only acceptable, but effective. Contrary to intuition, it stays manageable in practice with a small but consistent discipline: keeping deltas between implementations minimal and avoiding unnecessary divergence. Like within the Transformers design philosophy, we accept duplication and native explicitness by design. The motivations largely coincide, with some nuance in focus.
This is simpler to see than to explain. Compare RLOO and GRPO: large parts of their implementation are duplicated almost line for line. That will not be accidental, and it will not be dead weight. These methods are close enough that keeping their code paths aligned makes them easier to read, easier to evolve, and cheaper to keep up.
3. Where TRL suits
The goal of this comparison will not be to argue that TRL needs to be judged as best on every axis. It mustn’t. Some systems are built for max throughput (like PipelineRL), some are optimized for a narrower slice of the issue (like LLaMA-Factory), and a few offer a more opinionated development experience in a selected environment (like Tinker). TRL occupies a special place within the ecosystem: it’s a general-purpose post-training library that tries to maintain the API and the code so simple as the domain allows, while combining broad method coverage, deep Hugging Face integration, a comparatively low infrastructure burden, and an explicit stability contract.
Libraries like Unsloth and Axolotl aren’t included here because they construct on top of TRL relatively than sitting beside it within the comparison; in that sense, lots of their users are also TRL users not directly.
Ecosystem
| TRL | OpenRLHF | veRL | PRIME-RL | PipelineRL | OAT | Tinker | LLaMA-Factory | torchtune | |
|---|---|---|---|---|---|---|---|---|---|
| Hugging Face Hub integration | 🟢 Full | 🟡 No push | 🟡 Model loading only | 🟡 No push | 🟡 No push | 🟡 No push | 🟡 No dataset loading | 🟢 Full | 🟡 Dataset loading only |
| PEFT / LoRA / QLoRA support | 🟢 LoRA + QLoRA | 🟢 LoRA + QLoRA | 🟡 LoRA only (QAT as a substitute of QLoRA) | 🟡 LoRA only | 🔴 Not supported | 🟢 LoRA + QLoRA | 🟡 LoRA only | 🟢 LoRA + QLoRA | 🟢 LoRA + QLoRA (torchao, not bitsandbytes) |
| Experiment tracker flexibility | 🟢 Any (via report_to) |
🟡 wandb + tensorboard | 🟢 wandb, mlflow, swanlab, tensorboard | 🔴 wandb only | 🔴 wandb only | 🔴 wandb only | 🟡 DIY (metrics returned via API, no built-in tracker) | 🟢 Any (via report_to) + swanlab |
🟡 wandb + tensorboard (manual config) |
| Infrastructure burden | 🟢 Low (single GPU, standard stack) | 🟠 High (Ray required) | 🔴 Very high (Ray + rollout engine) | 🟠 High (separate vLLM server + ZMQ) | 🟠 High (async vLLM pipeline) | 🟡 Medium (vLLM needed for RL) | 🟢 Low (managed cloud service) | 🟢 Low (single script) | 🟢 Low (single script) |
Training methods
| TRL | OpenRLHF | veRL | PRIME-RL | PipelineRL | OAT | Tinker | LLaMA-Factory | torchtune | |
|---|---|---|---|---|---|---|---|---|---|
| VLM support | 🟢 Yes (SFT, DPO, GRPO) | 🔴 No | 🟢 Yes (in SFT + PPO trainers) | 🟡 Partial (Qwen3-VL only) | 🟢 Yes (via processor_factory) | 🔴 No | 🔴 No | 🟢 Yes (via mm_plugin) | 🟡 Partial (Llama Vision only) |
| Supervised post-training | 🟢 Yes (via SFTTrainer) | 🟢 Yes (via SFTTrainer) | 🟢 Yes (via SFTTrainer) | 🟢 Yes (via SFT entrypoint) | 🔴 No | 🟢 Yes (via SFTLearner) | 🔴 No (low-level primitives only) | 🟢 Yes (via SFT trainer) | 🟢 Yes (via finetune recipes) |
| Distillation post-training | 🟢 Yes (GKD, SDFT, SDPO) | 🔴 No | 🟢 Yes (dedicated distillation trainer) | 🟢 Yes (on-policy distillation) | 🔴 No | 🔴 No | 🔴 No (low-level primitives only) | 🔴 No | 🟢 Yes (native KD recipes) |
| Preference post-training | 🟢 Yes (DPO, KTO, ORPO, CPO, SimPO, IPO, …) | 🟡 DPO only | 🔴 No | 🔴 No | 🔴 No | 🟢 Yes (DPO, SimPO, IPO, XPO) | 🔴 No (low-level primitives only) | 🟡 DPO, KTO, ORPO (via TRL) | 🟡 DPO only |
| RL post-training | 🟢 Yes (PPO, GRPO, RLOO, …) | 🟢 Yes (PPO, REINFORCE++, GRPO, RLOO) | 🟢 Yes (PPO, GRPO, RLOO, REINFORCE++, DAPO, PRIME, …) | 🟢 Yes (async GRPO-style) | 🟢 Yes (GRPO, async) | 🟢 Yes (PPO, GRPO, Online DPO) | 🔴 No (low-level primitives only) | 🟠 PPO only (via TRL) | 🟠 PPO (GRPO in development) |
| Agent / environment support | 🟢 Yes (flexible environment_factory in GRPO) |
🟢 Yes (flexible AgentInstance interface) |
🟢 Yes (flexible BaseInteraction interface) |
🟡 Partial (tied to Prime Intellect’s Environments Hub) | 🟡 Partial (built-in domains: fn_calling, miniwob, …) | 🔴 No | 🔴 No (low-level primitives only) | 🔴 No | 🔴 No |
| Scalability | 🟡 Multi-node DeepSpeed/FSDP, no native TP | 🟢 Ray + DeepSpeed + vLLM TP at scale | 🟢 Megatron 3D parallelism, tested at 671B | 🟢 FSDP2 + TP/CP/EP, SLURM & K8s | 🟡 DeepSpeed/FSDP, no native training TP | 🔴 DeepSpeed ZeRO only, research-scale | 🟢 Managed service, handles large models | 🟡 FSDP2 + DeepSpeed, no native TP | 🟡 FSDP + PyTorch TP, no PP |
Project health
| TRL | OpenRLHF | veRL | PRIME-RL | PipelineRL | OAT | Tinker | LLaMA-Factory | torchtune | |
|---|---|---|---|---|---|---|---|---|---|
| Semver stability | 🟢 Yes | 🟡 Partly | 🟡 Partly | 🔴 No | 🔴 No | 🔴 No | 🟡 Partly | 🟡 Partly | 🟡 Partly |
| PyPI downloads / month | 3.0M | 3.6k | 101.6k | N/A | N/A | 218 | 363.3k | 25.6k | 370.9k |
| GitHub stars | 17.8k | 9.2k | 20.2k | 1.2k | 0.4k | 0.6k | 3.0k | 69.0k | 5.7k |
| Last release | 🟢 today | 🟢 yesterday | 🟡 1 week ago | 🟠 7 weeks ago | ⚫ No release | 🔴 3 months ago | ⚫ No release | 🔴 3 months ago | 🔴 11 months ago |
| Last commit | 🟢 yesterday | 🟢 yesterday | 🟢 today | 🟢 today | 🟡 3 weeks ago | 🟠 8 weeks ago | 🟢 yesterday | 🟢 yesterday | 🟡 1 week ago |
Some rows are factual (GitHub stars, Last release, Last commit), others are qualitative judgments (Semver stability).
Taken together, these comparisons point to a transparent role for TRL: a general-purpose library designed to maintain breadth, simplicity, integration, and stability in the identical place. Its full downstream footprint is difficult to measure, since most deployments are private and reverse dependencies are largely invisible, however the available signals already show that TRL operates at a distinctly different scale.
4. What’s next
By now, the logic of v1.0 needs to be clear: it will not be a claim that post-training has stabilized. Quite the opposite, it’s an acknowledgment that the sector will keep shifting, and that we’re confident that the library has the best shape to soak up whatever comes next. The query will not be what comes after v1.0, but what’s next for v1.0.
Asynchronous GRPO
Today, GRPO in TRL is primarily used through a synchronous loop: generate rollouts, rating them, then step the optimizer. That shape is easy and dependable, nevertheless it ties throughput to the slowest stage and leaves performance on the table at scale.
The fix is conceptually easy: generation and training don’t have to be lock-stepped. We have already got an early asynchronous GRPO design, and the subsequent step is to harden it. The core idea is to decouple generation and training, letting generation run constantly on dedicated inference resources while training consumes a gentle stream of scored trajectories, with buffering, backpressure, and clear policy-version accounting. This improves utilization and scales across GPUs and nodes. Other libraries already offer types of asynchronous RL, but bringing it to TRL would make this variety of training available through broader integrations, simpler APIs, and a much lower barrier to adoption.
Graduating methods to stable
The following candidates include KTO and newer distillation trainers akin to SDFT, SDPO, and possibly GOLD or GKD. As discussed in Section 2, before moving them to stable, the goal is to attenuate code differences across implementations and monitor sustained community interest relative to maintenance cost.
Scaling
TRL supports large-scale training, including multi-node runs and bigger models; the subsequent step is to make this path significantly more robust and easier to operate in production. That features stronger guarantees around distributed stability, clearer scaling defaults, and deeper support for Mixture-of-Experts (MoE), especially expert parallelism, where routing, load balancing, and memory behavior grow to be critical.
Making training legible to agents
Training remains to be too often driven by vibes. Loss curves go down, reward curves go up, a couple of samples look higher than before, and folks persuade themselves the run is working. When it fails, they scroll through logs, compare runs by eye, and guess. That’s already a weak interface for humans. For agents, it’s worse: it’s barely an interface in any respect.
Some of the vital directions for TRL is to make training legible to software, not only to people. Meaning going beyond dashboards and raw metrics to provide explicit signals: is the policy improving, collapsing, over-optimizing the verifier, drifting off-distribution, or plateauing? The goal is for TRL to surface these patterns routinely, explain them clearly, and switch them into actions.
The plan is to embed heuristics directly into the training loop and emit structured, actionable warnings — the sort a beginner can act on immediately and an agent can parse:
[TRL] WARNING: VRAM utilization at 34%. Consider increasing per_device_train_batch_size from 4 to 16.
...
[TRL] WARNING: Group reward std is 0.01 (near zero). Advantage signal has collapsed. Consider revisiting your reward function to make sure it provides sufficient variance for learning.
...
[TRL] WARNING: Clip ratio outside [0.8, 1.2] for 43% of updates. Consider reducing the educational rate.
Not only logging what happened — reasoning about what it means and what to do next. Useful for beginners who need guardrails, and for agents that need a training stack they’ll actually automate.
5. Conclusion
Post-training doesn’t converge. It shifts, and the subsequent shift is already coming.
v1.0 will not be a claim that things have settled. It’s an acknowledgment that they have not yet, and a commitment that the library could be relied on anyway. Six years of evolving alongside the sector — and alongside the tons of of contributors who made it possible — shaped a design we’re confident is prepared for what comes next, whatever that seems to be. The community and the downstream projects had already assumed that stability — v1.0 makes it real.
pip install --upgrade trl
Migration from the last 0.x release is minimal, and the migration guide covers all the pieces. When you’re recent, now could be an excellent time to begin.

