When training large language models (LLMs) with reinforcement learning from verifiable rewards (RLVR), one of the crucial compelling questions is learn how to overcome performance plateaus. The previous NVIDIA Research solution, Prolonged Reinforcement Learning (ProRL), showed that adding more reinforcement learning (RL) steps during prolonged training could expand the reasoning boundaries of LLMs.
But eventually, the team hit a wall. After hundreds of steps, performance gains diminished, and the model’s improvement stagnated, and even began to degrade. For more details, see Scaling LLM Reinforcement Learning with Prolonged Training Using ProRL v2.
This raises a critical query: Is that this plateau a fundamental limit of RL, or is it an artifact of how scaling is performed?
Today, we’re excited to introduce Broadened Reinforcement Learning (BroRL), a brand new paradigm that explores a complementary and powerful scaling dimension: rollout scaling. As an alternative of just training for more steps, BroRL dramatically increases the variety of exploratory rollouts for every prompt to the order of tons of. This approach breaks through the performance ceiling where other methods stall, and proves to be significantly more data- and compute-efficient. We’re releasing our state-of-the-art 1.5B model trained with BroRL.
This post dives into related core theoretical insights, recent empirical results, and why scaling rollouts is the important thing to unlocking the following level of reasoning in LLMs.
How does BroRL enable continuous learning?
Most RL scaling efforts give attention to training length. This often results in an unstable learning signal, where the model struggles to flee its existing knowledge base. The perceived limits of RL are sometimes just the bounds of its exploration strategy.
BroRL challenges this paradigm by specializing in rollout scaling for exploration at each update step. The goal is to maneuver beyond incremental gains by fundamentally stabilizing the RL process, enabling continuous learning where it previously stalled.
| Step scaling (ProRL, for instance) | Rollout scaling (BroRL) |
| Scales with more training steps (3,000+) | Scales with more rollouts per prompt (N=512) |
| Hits a performance plateau; diminishing returns | Breaks the plateau; robust, continuous improvement |
| Learning signal might be unstable and noisy | Stable, high-quality updates from exhaustive exploration |
| Becomes inefficient on the saturation point | More compute- and data-efficient |
How does rollout scaling control RL instability?
As detailed in BroRL: Scaling Reinforcement Learning via Broadened Exploration, our theoretical evaluation (Section 2) reveals that the RL update process is governed by two competing forces: sampled rollouts and unsampled space.
To supply an analogy, consider it like exploring an enormous, foggy landscape to seek out the very best peak. The paths you truly walk (sampled rollouts) provide reliable, positive feedback, helping you gain altitude. Yet the infinite variety of paths you don’t take (the unsampled space) create uncertainty and noise. This noise acts like a gravitational pull, dragging you back down the hill. While you only send out just a few scouts (N=16 in ProRL), their reports are noisy, and this downward pull might be strong enough to halt your ascent, leaving you stuck on a plateau.
The BroRL solution is easy but powerful: send out a whole army of scouts (N=512). By mapping an enormous portion of the landscape, the random noise from the unexplored fog averages out and becomes overwhelmingly strong. The “upward signal” from all of the successful paths becomes overwhelmingly strong.
In our formal evaluation, this implies the online change within the model’s performance becomes positive () when N is large. This provides a stable, high-quality learning signal that permits the model to climb past the plateau.
Breaking through the RL performance plateau
We applied the BroRL recipe to a robust ProRLv2 model that had already plateaued after 3,000 training steps. The outcomes were definitive.
Figure 1 tells a robust story. While continuing with the ProRL recipe (blue line) results in stagnation and eventual degradation, BroRL (orange line) revives the model, enabling robust and continuous performance gains that break through the previous ceiling.


BroRL comprehensive results
We continued training the three,000-step ProRLv2 checkpoint using each the unique recipe (N=16) and the brand new BroRL recipe (N=512) using 64 NVIDIA H100 GPUs. The divergence was clear: ProRL stagnated, while BroRL delivered regular, significant gains in less time.
| Method | N | RL steps | Total time (h) | Math rating | Code rating | Reasoning Gym rating |
| Baseline | 16 | 2,000 | – | 60.14 | 51.43 | 59.06 |
| Baseline | 16 | 3,000 | – | 61.69 | 52.00 | 61.29 |
| ProRL | 16 | 3,000+225 | +56.3 | 62.08 | 52.26 | 62.10 |
| ProRL | 16 | 3,000+535 | +133.8 | 62.02 (stagnated) | 52.74 | 61.45 (degraded) |
| BroRL | 512 | 3,000+107 | +98.1 | 62.62 | 53.31 | 62.71 |
| BroRL | 512 | 3,000+134 | +122.8 | 62.85 | 53.48 | 62.82 |
| BroRL | 512 | 3,000+419 | +393.9 | 63.66 | 56.64 | 63.40 |
After just 98.1 hours, BroRL had already decisively surpassed the ultimate performance of the ProRL method across all metrics, doing so in roughly 35 fewer hours. This confirms that scaling rollout size is a more practical and computationally efficient strategy for pushing the boundaries of a saturated model.
BroRL sets a state-of-the-art for 1.5B reasoning models, achieving the very best scores in Math (63.66), Code (56.64), and Reasoning Gym (63.40) benchmarks.
Superior compute efficiency
BroRL isn’t just higher—it’s faster and smarter with its compute.
- Algorithmic efficiency: Large-N rollouts produce a more diverse set of candidate samples. The pass rate for dynamic sampling, which filters out uninformative trajectories, jumped from 41% to 62%, meaning less computation was wasted.
- Hardware efficiency: BroRL shifts the generation process from being memory-bound to compute-bound and improves the prefix cache hit rate. Consequently, the GPU can fully utilize its parallel processing power, nearly doubling the throughput from 36.5 to 72.4 samples/s in our hardware setup.
| Method (N) | Dynamic sampling pass rate | Generation throughput (samples/s) |
| ProRL (16) | 41% | 36.5 |
| BroRL (512) | 62% | 72.4 |
Greater token efficiency
BroRL delivers higher accuracy with fewer output tokens on each Math and Code benchmarks, indicating higher score-per-token efficiency and tighter, less redundant reasoning.
Large-N rollout exploration (N=512) surfaces many concise, high-yield trajectories per prompt, which each raises the prospect of sampling compact correct chains and reduces reliance on verbose, low-signal reasoning. This decouples quality from response length where step-scaling typically inflates tokens.
| Task | ProRL rating | BroRL rating | Rating diff | ProRL tokens | BroRL tokens | Token diff |
| Math | 62.02 | 63.66 | +1.64 | 16,506 | 15,760 | -745 |
| Code | 52.74 | 56.64 | +3.90 | 26,808 | 26,090 | -717 |
Start with BroRL
Our findings establish rollout size not only as a hyperparameter, but as a critical and efficient axis for scaling reinforcement learning. The performance plateaus encountered by step-scaling methods will not be fundamental limits of RL but artifacts of insufficient exploration. Key insights and takeaways include:
- Rollout scaling is a brand new, crucial scaling dimension for RL. It provides a stable learning signal where depth-scaling alone fails.
- Performance plateaus will not be dead ends. They might be overcome by scaling rollouts to generate higher-quality policy updates.
- BroRL is more computationally efficient, doubling hardware throughput and improving algorithmic sample efficiency.
- BroRL is more token efficient, achieving more with less.
- The brand new BroRL-trained checkpoint sets a state-of-the-art for 1.5B reasoning models.
For those trying to maximize the potential of their models with RL, BroRL provides a principled path forward: once you hit a wall, don’t just push forward—go wider.
To start, explore and evaluate the BroRwL model, available through Hugging Face.
Acknowledgments
Thanks to Yejin Choi, Fang Wu, Zaid Harchaoui, Pavlo Molchanov, Jan Kautz, and Jun Yang for his or her contributions to this post.
