Breaking Through Reinforcement Learning Training Limits with Scaling Rollouts in BroRL

-


When training large language models (LLMs) with reinforcement learning from verifiable rewards (RLVR), one of the crucial compelling questions is learn how to overcome performance plateaus. The previous NVIDIA Research solution, Prolonged Reinforcement Learning (ProRL), showed that adding more reinforcement learning (RL) steps during prolonged training could expand the reasoning boundaries of LLMs. 

But eventually, the team hit a wall. After hundreds of steps, performance gains diminished, and the model’s improvement stagnated, and even began to degrade. For more details, see Scaling LLM Reinforcement Learning with Prolonged Training Using ProRL v2

This raises a critical query: Is that this plateau a fundamental limit of RL, or is it an artifact of how scaling is performed? 

Today, we’re excited to introduce Broadened Reinforcement Learning (BroRL), a brand new paradigm that explores a complementary and powerful scaling dimension: rollout scaling. As an alternative of just training for more steps, BroRL dramatically increases the variety of exploratory rollouts for every prompt to the order of tons of. This approach breaks through the performance ceiling where other methods stall, and proves to be significantly more data- and compute-efficient. We’re releasing our state-of-the-art 1.5B model trained with BroRL. 

This post dives into related core theoretical insights, recent empirical results, and why scaling rollouts is the important thing to unlocking the following level of reasoning in LLMs.

How does BroRL enable continuous learning?

Most RL scaling efforts give attention to training length. This often results in an unstable learning signal, where the model struggles to flee its existing knowledge base. The perceived limits of RL are sometimes just the bounds of its exploration strategy.

BroRL challenges this paradigm by specializing in rollout scaling for exploration at each update step. The goal is to maneuver beyond incremental gains by fundamentally stabilizing the RL process, enabling continuous learning where it previously stalled.

Step scaling (ProRL, for instance) Rollout scaling (BroRL)
Scales with more training steps (3,000+) Scales with more rollouts per prompt (N=512)
Hits a performance plateau; diminishing returns Breaks the plateau; robust, continuous improvement
Learning signal might be unstable and noisy Stable, high-quality updates from exhaustive exploration
Becomes inefficient on the saturation point More compute- and data-efficient
Table 1. Core comparison of step scaling (ProRL) and rollout scaling (BroRL)

How does rollout scaling control RL instability?

As detailed in BroRL: Scaling Reinforcement Learning via Broadened Exploration, our theoretical evaluation (Section 2) reveals that the RL update process is governed by two competing forces: sampled rollouts and unsampled space. 

To supply an analogy, consider it like exploring an enormous, foggy landscape to seek out the very best peak. The paths you truly walk (sampled rollouts) provide reliable, positive feedback, helping you gain altitude. Yet the infinite variety of paths you don’t take (the unsampled space) create uncertainty and noise. This noise acts like a gravitational pull, dragging you back down the hill. While you only send out just a few scouts (N=16 in ProRL), their reports are noisy, and this downward pull might be strong enough to halt your ascent, leaving you stuck on a plateau.

The BroRL solution is easy but powerful: send out a whole army of scouts (N=512). By mapping an enormous portion of the landscape, the random noise from the unexplored fog averages out and becomes overwhelmingly strong. The “upward signal” from all of the successful paths becomes overwhelmingly strong. 

In our formal evaluation, this implies the online change within the model’s performance becomes positive (Delta Q_{pos} ge 0Delta Q_{pos} ge 0) when N is large. This provides a stable, high-quality learning signal that permits the model to climb past the plateau.

Breaking through the RL performance plateau

We applied the BroRL recipe to a robust ProRLv2 model that had already plateaued after 3,000 training steps. The outcomes were definitive.

Figure 1 tells a robust story. While continuing with the ProRL recipe (blue line) results in stagnation and eventual degradation, BroRL (orange line) revives the model, enabling robust and continuous performance gains that break through the previous ceiling.

A line graph titled ‘Math Score Improvement Over Time’ that displays two lines representing different training methods, labeled ProRL and BroRL, against the training time in hours on the x-axis.
A line graph titled ‘Math Score Improvement Over Time’ that displays two lines representing different training methods, labeled ProRL and BroRL, against the training time in hours on the x-axis.
Figure 1. BroRL (N=512) demonstrates continuous performance improvement on the Math benchmark, whereas ProRL (N=16) reaches a plateau and degrades with prolonged training

BroRL comprehensive results 

We continued training the three,000-step ProRLv2 checkpoint using each the unique recipe (N=16) and the brand new BroRL recipe (N=512) using 64 NVIDIA H100 GPUs. The divergence was clear: ProRL stagnated, while BroRL delivered regular, significant gains in less time.

Method N RL steps Total time (h) Math rating Code rating Reasoning Gym rating
Baseline 16 2,000 60.14 51.43 59.06
Baseline 16 3,000 61.69 52.00 61.29
ProRL 16 3,000+225 +56.3 62.08 52.26 62.10
ProRL 16 3,000+535 +133.8 62.02 (stagnated) 52.74 61.45 (degraded)
BroRL 512 3,000+107 +98.1 62.62 53.31 62.71
BroRL 512 3,000+134 +122.8 62.85 53.48 62.82
BroRL 512 3,000+419 +393.9 63.66 56.64 63.40
Table 2. Comprehensive performance comparison of BroRL and ProRL on key reasoning benchmarks

After just 98.1 hours, BroRL had already decisively surpassed the ultimate performance of the ProRL method across all metrics, doing so in roughly 35 fewer hours. This confirms that scaling rollout size is a more practical and computationally efficient strategy for pushing the boundaries of a saturated model.

BroRL sets a state-of-the-art for 1.5B reasoning models, achieving the very best scores in Math (63.66), Code (56.64), and Reasoning Gym (63.40) benchmarks.

Superior compute efficiency

BroRL isn’t just higher—it’s faster and smarter with its compute.

  • Algorithmic efficiency: Large-N rollouts produce a more diverse set of candidate samples. The pass rate for dynamic sampling, which filters out uninformative trajectories, jumped from 41% to 62%, meaning less computation was wasted.
  • Hardware efficiency: BroRL shifts the generation process from being memory-bound to compute-bound and improves the prefix cache hit rate. Consequently, the GPU can fully utilize its parallel processing power, nearly doubling the throughput from 36.5 to 72.4 samples/s in our hardware setup.
Method (N) Dynamic sampling pass rate Generation throughput (samples/s)
ProRL (16) 41% 36.5
BroRL (512) 62% 72.4
Table 3. Compute efficiency metrics for BroRL versus ProRL (sampling pass rate and throughput)

Greater token efficiency

BroRL delivers higher accuracy with fewer output tokens on each Math and Code benchmarks, indicating higher score-per-token efficiency and tighter, less redundant reasoning.

Large-N rollout exploration (N=512) surfaces many concise, high-yield trajectories per prompt, which each raises the prospect of sampling compact correct chains and reduces reliance on verbose, low-signal reasoning. This decouples quality from response length where step-scaling typically inflates tokens.

Task ProRL rating BroRL rating Rating diff ProRL tokens BroRL tokens Token diff
Math 62.02 63.66 +1.64 16,506 15,760 -745
Code 52.74 56.64 +3.90 26,808 26,090 -717
Table 4. Token efficiency comparison of BroRL and ProRL on math and code tasks

Start with BroRL

Our findings establish rollout size not only as a hyperparameter, but as a critical and efficient axis for scaling reinforcement learning. The performance plateaus encountered by step-scaling methods will not be fundamental limits of RL but artifacts of insufficient exploration. Key insights and takeaways include:

  • Rollout scaling is a brand new, crucial scaling dimension for RL. It provides a stable learning signal where depth-scaling alone fails.
  • Performance plateaus will not be dead ends. They might be overcome by scaling rollouts to generate higher-quality policy updates.
  • BroRL is more computationally efficient, doubling hardware throughput and improving algorithmic sample efficiency.
  • BroRL is more token efficient, achieving more with less.
  • The brand new BroRL-trained checkpoint sets a state-of-the-art for 1.5B reasoning models.

For those trying to maximize the potential of their models with RL, BroRL provides a principled path forward: once you hit a wall, don’t just push forward—go wider.

To start, explore and evaluate the BroRwL model, available through Hugging Face. 

Acknowledgments

Thanks to Yejin Choi, Fang Wu, Zaid Harchaoui, Pavlo Molchanov, Jan Kautz, and Jun Yang for his or her contributions to this post.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x