PipelineRL

-




PipelineRL

We’re excited to open-source PipelineRL, an experimental RL implementation that tackles a fundamental challenge in large-scale Reinforcement Learning with LLMs: the trade-off between inference throughput and on-policy data collection. PipelineRL’s key innovation is inflight weight updates during RL training (see Figure 1 below). This permits PipelineRL to realize continuously high inference throughput and minimize the lag between the weights used for rollouts and probably the most recently updated model weights. The result: fast and stable RL training for big language models.

image/jpeg

On this blog post, we show that 1) inflight weight updates don’t harm the training process and a couple of) PipelineRL achieves competitive results in comparison with Open-Reasoner-Zero, while using an easier RL algorithm. We also present the modular PipelineRL architecture that facilitates trying recent inference / trainer mixtures.



Conventional RL vs PipelineRL

In conventional RL approaches (Figure 1a), there’s a trade-off between high throughput inference and on-policy data collection. To clarify this trade-off allow us to first define conventional RL algorithmically:

current_policy = initial_policy
opt_state = init_optimizer(current_policy)

while True:
    
    
    inference_policy = current_policy
    list_of_prompts = [sample_prompts(training_batch_size) 
        for _ in range(num_grad_steps)]
    list_of_rollouts = [sample_rollouts(prompts, inference_policy) 
        for prompts in list_of_prompts]
    
    lag = 0 
    for rollouts in list_of_rollouts:
        current_policy, opt_state = policy_update(current_policy, opt_state, rollouts)
        lag += 1
    

To attain high throughput, the inference servers must use large batch sizes and, subsequently, generate data for multiple policy optimization steps. Nonetheless, each optimization step increases the lag between the present policy and the info collected using the inference policy, progressively making collected data more off-policy and fewer effective for training. On-policy learning requires data for a single optimization step. But producing small amounts of knowledge with many GPUs is inefficient because this implies the per-GPU batch size is small. Moreover the batch size goes down because the inference server finishes the short sequences and has only the few longest sequences in progress.

PipelineRL (Figure 1b) remediates this trade-off through inflight weight updates. We update the weights in inference servers after each optimizer step without ever stopping inference. We only pause the inference in any respect inference servers for just the time needed to receive the brand new weights. Inflight weight updates allow the inference server to continuously maintain the optimal batch size while concurrently ensuring data stays on-policy or near on-policy, which results in higher GPU utilization and more practical learning, respectively.



PipelineRL works!

image/png

To display the effectiveness of PipelineRL and the advantages of inflight weight updates, we trained a 7B model and 32B model on the Open-Reasoner-Zero dataset. Taking a look at the training curves we see that PipelineRL matches or exceeds the performance of Open-Reasoner on the favored reasoning test benchmarks: AIME 2024 and MATH 500 (see Figure 2 above).

Notably, our RL implementation is far simpler than Open-Reasoner-Zero. While Open-Reasoner-Zero uses a price function, our implementation is a simplified version of GRPO. Specifically, we found that the trust region importance weight clamping isn’t needed for stable training. Neither was overlong sequence filtering or reward shaping from the DAPO paper. For normalizing the loss we just use the variety of sequences within the batch because the denominator, giving equal weights to all tokens. We used no KL penalty and no entropy bonus (though our implementation does support reference model KL). Despite the simplicity of our implementation, or perhaps due to it, training may be very stable as you possibly can see on this wandb report.

One might expect that inflight weight updates would end in an unstable training process, since sequence generation proceeds with stale keys and values within the KV cache that was computed with a previous model version. Nonetheless, our experiments indicate this doesn’t adversely affect stability.



PipelineRL architecture

image/jpeg

PipelineRL is built to be modular and reap the benefits of rapid improvements in highly-specialized inference and training software (SGLang, vLLM, Nvidia Dynamo, DeepSpeed, FSDP, TorchTitan, FastLLM, etc.). We propose clear contracts between the inference and
training components, allowing easy integration of latest inference and training solutions as they turn out to be available.



Inference contract

The inference software must expose the next APIs to PipelineRL[1]:

  1. Process group initialization: At start-up time, Trainer 0 (the designated coordinator) sends an HTTP POST /init_process_group request to all inference servers. This request initializes the method group that will probably be used for sending the burden updates.
  2. Weight Update Trigger: Once the trainers complete a learning step (optimizer step and weight gathering), Trainer 0 submits an HTTP POST /request_weight_update request to the inference endpoint. The request accommodates the small print on the order and shapes of the weights that the most important trainer process is about to transfer via NCCL. The inference servers must pause the inference and receive the burden broadcast.
  3. Chat completion: The actor process interacts with the actor LLMs using HTTP POST /v1/chat/completion requests.

If init_process_group and request_weight_update APIs turn out to be the industry standard, one will have the option to plug-and-play try using difference inference implementations with PipelineRL.



Trainer contract

PipelineRL training code feeds freshly-generated training data to trainer employees as soon as the fitting number of coaching tokens has accrued for every of them. One could make any training software that exposes these Python APIs work with PipelineRL:

  • Employee initialization Load and shard training weights and the optimizer state.
  • Forward pass Produce token log-likelihoods given inputs.
  • Backward step: Compute and accumulate the gradient of the scalar that represents the chosen RL objective.
  • Optimizer Step: Execute the optimizer step.
  • Weight gathering and broadcasting: After an optimizer step, the trainer software must gather the updated model weights layer-by-layer in preparation for broadcasting them to the inference servers.

PipelineRL currently uses the HuggingFace speed up library to provide the user a selection between DeepSpeed and FSDP. But we found that speed up contract is just too flexible and might be confusing. We will probably be moving to the stricter contract as described above that may make using other trainers easier.



What’s next for PipelineRL?

Upcoming features. Our implementation continues to be experimental and lacks some vital functionality. Top priorities for us include using coroutines for more precise inference batch size control, multi-modal support and sequence parallel training. We’d also welcome contributions of more inference server and trainer integrations. We won’t, nonetheless, attempt to make the pipeline-rl repo a framework that supports all possible algorithms and reward functions. Our take is that pipeline-rl needs to be a hackable and fast reference implementation of GRPO with easily verifiable rewards. When you’d prefer to do a research project using PipelineRL, you possibly can just fork the repo and have a good time hacking the code!

More research coming soon. More evaluation is required to know how inflight weight updates affect the training dynamics, and to rigorously measure the speed-ups that PipelineRL brings. Also, much might be said in regards to the similarities between PipelineRL and
highly relevant prior work on asynchronous Reinforcement Learning for LLMs. For all this and more please stay tuned for our upcoming research paper!



Contributors and Acknowledgement

Alexandre Piché wrote the primary synchronous version of our RL code while working on TapeAgents. Dzmitry Bahdanau refactored the code to be asynchronous and distributed, and implemented inflight weight updates. Rafael Pardinas implemented sequence packing. Ehsan Kamaloo helped with running the experiments. Xiaoyin Chen helped with debugging the framework.

We acknowledge the prior RL for LLM implementations corresponding to TRL, OpenRLHF and veRL for the various tricks we borrowed from them. Artifacts from other open-source reasoning projects, corresponding to Easy-RL, Deepscaler, DAPO and OpenReasoner were instrumental for stabilizing PipelineRL. We would love to acknowledge Christopher Manning, Michael Noukhovitch for his or her thoughtful comments. Finally, we thank the broader ServiceNow Research team and ServiceNow CoreLLM teams for being amazing colleagues.

[1] The present contract within the code is barely different, but we’re refactoring it as described above.



Experimental Details

We used the identical hyperparameters for each 7B and 32B experiments we report here:

  • batch size 4096
  • learning rate 1e-6
  • max variety of generated tokens 8192
    • note that in OpenReasoner runs they allowed generation of 16K tokens

The compute that we used for the reported experiments

  • ~3.5 days on 2 nodes for the 7B model
  • ~6 days on 4 nodes for the 32B model



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x