Breaking Through Reinforcement Learning Training Limits with Scaling Rollouts in BroRL

When training large language models (LLMs) with reinforcement learning from verifiable rewards (RLVR), one of the crucial compelling questions is learn how to overcome performance plateaus. The previous NVIDIA Research solution, Prolonged Reinforcement Learning (ProRL), showed that adding more reinforcement learning (RL) steps during prolonged training could expand the reasoning boundaries of LLMs.

But eventually, the team hit a wall. After hundreds of steps, performance gains diminished, and the model’s improvement stagnated, and even began to degrade. For more details, see Scaling LLM Reinforcement Learning with Prolonged Training Using ProRL v2.

This raises a critical query: Is that this plateau a fundamental limit of RL, or is it an artifact of how scaling is performed?

Today, we’re excited to introduce Broadened Reinforcement Learning (BroRL), a brand new paradigm that explores a complementary and powerful scaling dimension: rollout scaling. As an alternative of just training for more steps, BroRL dramatically increases the variety of exploratory rollouts for every prompt to the order of tons of. This approach breaks through the performance ceiling where other methods stall, and proves to be significantly more data- and compute-efficient. We’re releasing our state-of-the-art 1.5B model trained with BroRL.

This post dives into related core theoretical insights, recent empirical results, and why scaling rollouts is the important thing to unlocking the following level of reasoning in LLMs.

Step scaling (ProRL, for instance)	Rollout scaling (BroRL)
Scales with more training steps (3,000+)	Scales with more rollouts per prompt (N=512)
Hits a performance plateau; diminishing returns	Breaks the plateau; robust, continuous improvement
Learning signal might be unstable and noisy	Stable, high-quality updates from exhaustive exploration
Becomes inefficient on the saturation point	More compute- and data-efficient

Method	N	RL steps	Total time (h)	Math rating	Code rating	Reasoning Gym rating
Baseline	16	2,000	–	60.14	51.43	59.06
Baseline	16	3,000	–	61.69	52.00	61.29
ProRL	16	3,000+225	+56.3	62.08	52.26	62.10
ProRL	16	3,000+535	+133.8	62.02 (stagnated)	52.74	61.45 (degraded)
BroRL	512	3,000+107	+98.1	62.62	53.31	62.71
BroRL	512	3,000+134	+122.8	62.85	53.48	62.82
BroRL	512	3,000+419	+393.9	63.66	56.64	63.40

Method (N)	Dynamic sampling pass rate	Generation throughput (samples/s)
ProRL (16)	41%	36.5
BroRL (512)	62%	72.4

Task	ProRL rating	BroRL rating	Rating diff	ProRL tokens	BroRL tokens	Token diff
Math	62.02	63.66	+1.64	16,506	15,760	-745
Code	52.74	56.64	+3.90	26,808	26,090	-717

Breaking Through Reinforcement Learning Training Limits with Scaling Rollouts in BroRL

How does BroRL enable continuous learning?

How does rollout scaling control RL instability?

Breaking through the RL performance plateau

BroRL comprehensive results

Superior compute efficiency

Greater token efficiency

Start with BroRL

Acknowledgments

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Introducing Mistral AI Studio. | Mistral AI

Introducing Claude Haiku 4.5 Anthropic

Trends and Insights with Recent Multilingual & Long-Form Tracks

Google tells employees it must double capability every 6 months to fulfill AI demand

Grok 4.1 Fast's compelling dev access and Agent Tools API overshadowed by Musk glazing

Breaking Through Reinforcement Learning Training Limits with Scaling Rollouts in BroRL

How does BroRL enable continuous learning?

How does rollout scaling control RL instability?

Breaking through the RL performance plateau

BroRL comprehensive results

Superior compute efficiency

Greater token efficiency

Start with BroRL

Acknowledgments

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.