Effective-tuning 20B LLMs with RLHF on a 24GB consumer GPU

We’re excited to officially release the combination of trl with peft to make Large Language Model (LLM) fine-tuning with Reinforcement Learning more accessible to anyone! On this post, we explain why it is a competitive alternative to existing fine-tuning approaches.

Note peft is a general tool that could be applied to many ML use-cases but it surely’s particularly interesting for RLHF as this method is very memory-hungry!

If you wish to directly deep dive into the code, try the instance scripts directly on the documentation page of TRL.

Introduction

LLMs & RLHF

LLMs combined with RLHF (Reinforcement Learning with Human Feedback) appears to be the subsequent go-to approach for constructing very powerful AI systems resembling ChatGPT.

Training a language model with RLHF typically involves the next three steps:

1- Effective-tune a pretrained LLM on a selected domain or corpus of instructions and human demonstrations

2- Collect a human annotated dataset and train a reward model

3- Further fine-tune the LLM from step 1 with the reward model and this dataset using RL (e.g. PPO)

The alternative of the bottom LLM is sort of crucial here. Right now of writing, the “best” open-source LLM that could be used “out-of-the-box” for a lot of tasks are instruction finetuned LLMs. Notable models being: BLOOMZ, Flan-T5, Flan-UL2, and OPT-IML. The downside of those models is their size. To get an honest model, you would like at the least to play with 10B+ scale models which might require as much as 40GB GPU memory in full precision, simply to fit the model on a single GPU device without doing any training in any respect!

What’s TRL?

The trl library goals at making the RL step much easier and more flexible in order that anyone can fine-tune their LM using RL on their custom dataset and training setup. Amongst many other applications, you need to use this algorithm to fine-tune a model to generate positive movie reviews, do controlled generation or make the model less toxic.

Using trl you possibly can run one of the crucial popular Deep RL algorithms, PPO, in a distributed manner or on a single device! We leverage speed up from the Hugging Face ecosystem to make this possible, in order that any user can scale up the experiments as much as an interesting scale.

Effective-tuning a language model with RL follows roughly the protocol detailed below. This requires having 2 copies of the unique model; to avoid the lively model deviating an excessive amount of from its original behavior / distribution you might want to compute the logits of the reference model at each optimization step. This adds a tough constraint on the optimization process as you would like all the time at the least two copies of the model per GPU device. If the model grows in size, it becomes an increasing number of tricky to suit the setup on a single GPU.


Overview of the PPO training setup in TRL.

In trl it’s also possible to use shared layers between reference and lively models to avoid entire copies. A concrete example of this feature is showcased within the detoxing example.

Training at scale

Training at scale could be difficult. The primary challenge is fitting the model and its optimizer states on the available GPU devices. The quantity of GPU memory a single parameter takes depends upon its “precision” (or more specifically dtype). Essentially the most common dtype being float32 (32-bit), float16, and bfloat16 (16-bit). More recently “exotic” precisions are supported out-of-the-box for training and inference (with certain conditions and constraints) resembling int8 (8-bit). In a nutshell, to load a model on a GPU device each billion parameters costs 4GB in float32 precision, 2GB in float16, and 1GB in int8. When you would really like to learn more about this topic, have a take a look at this blogpost which dives deeper: https://huggingface.co/blog/hf-bitsandbytes-integration.

When you use an AdamW optimizer each parameter needs 8 bytes (e.g. in case your model has 1B parameters, the complete AdamW optimizer of the model would require 8GB GPU memory – source).

Many techniques have been adopted to tackle these challenges at scale. Essentially the most familiar paradigms are Pipeline Parallelism, Tensor Parallelism, and Data Parallelism.

With data parallelism the identical model is hosted in parallel on several machines and every instance is fed a unique data batch. That is essentially the most simple parallelism strategy essentially replicating the single-GPU case and is already supported by trl. With Pipeline and Tensor Parallelism the model itself is distributed across machines: in Pipeline Parallelism the model is split layer-wise, whereas Tensor Parallelism splits tensor operations across GPUs (e.g. matrix multiplications). With these Model Parallelism strategies, you might want to shard the model weights across many devices which requires you to define a communication protocol of the activations and gradients across processes. This is just not trivial to implement and might need the adoption of some frameworks resembling Megatron-DeepSpeed or Nemo. Additionally it is vital to focus on other tools which can be essential for scaling LLM training resembling Adaptive activation checkpointing and fused kernels. Further reading about parallelism paradigms could be found here.

Due to this fact, we asked ourselves the next query: how far can we go along with just data parallelism? Can we use existing tools to suit super-large training processes (including lively model, reference model and optimizer states) in a single device? The reply appears to be yes. The foremost ingredients are: adapters and 8bit matrix multiplication! Allow us to cover these topics in the next sections:

8-bit matrix multiplication

Efficient 8-bit matrix multiplication is a technique that has been first introduced within the paper LLM.int8() and goals to resolve the performance degradation issue when quantizing large-scale models. The proposed method breaks down the matrix multiplications which can be applied under the hood in Linear layers in two stages: the outlier hidden states part that’s going to be performed in float16 & the “non-outlier” part that’s performed in int8.


Efficient 8-bit matrix multiplication is a technique that has been first introduced within the paper LLM.int8() and goals to resolve the performance degradation issue when quantizing large-scale models. The proposed method breaks down the matrix multiplications which can be applied under the hood in Linear layers in two stages: the outlier hidden states part that’s going to be performed in float16 & the “non-outlier” part that’s performed in int8.

In a nutshell, you possibly can reduce the dimensions of a full-precision model by 4 (thus, by 2 for half-precision models) should you use 8-bit matrix multiplication.

Low rank adaptation and PEFT

In 2021, a paper called LoRA: Low-Rank Adaption of Large Language Models demonstrated that effective tuning of enormous language models could be performed by freezing the pretrained weights and creating low rank versions of the query and value layers attention matrices. These low rank matrices have far fewer parameters than the unique model, enabling fine-tuning with far less GPU memory. The authors show that fine-tuning of low-rank adapters achieved comparable results to fine-tuning the complete pretrained model.


The output activations original (frozen) pretrained weights (left) are augmented by a low rank adapter comprised of weight matrics A and B (right).

This system allows the effective tuning of LLMs using a fraction of the memory requirements. There are, nonetheless, some downsides. The forward and backward pass is roughly twice as slow, attributable to the extra matrix multiplications within the adapter layers.

What’s PEFT?

Parameter-Efficient Effective-Tuning (PEFT), is a Hugging Face library, created to support the creation and effective tuning of adapter layers on LLMs.peft is seamlessly integrated with 🤗 Speed up for big scale models leveraging DeepSpeed and Big Model Inference.

The library supports many cutting-edge models and has an in depth set of examples, including:

Causal language modeling
Conditional generation
Image classification
8-bit int8 training
Low Rank adaption of Dreambooth models
Semantic segmentation
Sequence classification
Token classification

The library continues to be under extensive and lively development, with many upcoming features to be announced in the approaching months.

Effective-tuning 20B parameter models with Low Rank Adapters

Now that the prerequisites are out of the best way, allow us to undergo your entire pipeline step-by-step, and explain with figures how you possibly can fine-tune a 20B parameter LLM with RL using the tools mentioned above on a single 24GB GPU!

Step 1: Load your lively model in 8-bit precision


Loading a model in 8-bit precision can save as much as 4x memory in comparison with full precision model

A “free-lunch” memory reduction of a LLM using transformers is to load your model in 8-bit precision using the tactic described in LLM.int8. This could be performed by simply adding the flag load_in_8bit=True when calling the from_pretrained method (you possibly can read more about that here).

As stated within the previous section, a “hack” to compute the quantity of GPU memory you need to must load your model is to think when it comes to “billions of parameters”. As one byte needs 8 bits, you would like 4GB per billion parameters for a full-precision model (32bit = 4bytes), 2GB per billion parameters for a half-precision model, and 1GB per billion parameters for an int8 model.

So in the primary place, let’s just load the lively model in 8-bit. Let’s see what we’d like to do for the second step!

Step 2: Add extra trainable adapters using `peft`


You easily add adapters on a frozen 8-bit model thus reducing the memory requirements of the optimizer states, by training a small fraction of parameters

The second step is to load adapters contained in the model and make these adapters trainable. This permits a drastic reduction of the variety of trainable weights which can be needed for the lively model. This step leverages peft library and could be performed with just a few lines of code. Note that after the adapters are trained, you possibly can easily push them to the Hub to make use of them later.

Step 3: Use the identical model to get the reference and lively logits


You possibly can easily disable and enable adapters using the `peft` API.

Since adapters could be deactivated, we will use the identical model to get the reference and lively logits for PPO, without having to create two copies of the identical model! This leverages a feature in peft library, which is the disable_adapters context manager.

Overview of the training scripts:

We are going to now describe how we trained a 20B parameter gpt-neox model using transformers, peft and trl. The top goal of this instance was to fine-tune a LLM to generate positive movie reviews in a memory constrained settting. Similar steps may very well be applied for other tasks, resembling dialogue models.

Overall there have been three key steps and training scripts:

Script – Effective tuning a Low Rank Adapter on a frozen 8-bit model for text generation on the imdb dataset.
Script – Merging of the adapter layers into the bottom model’s weights and storing these on the hub.
Script – Sentiment fine-tuning of a Low Rank Adapter to create positive reviews.

We tested these steps on a 24GB NVIDIA 4090 GPU. While it is feasible to perform your entire training run on a 24 GB GPU, the complete training runs were untaken on a single A100 on the 🤗 reseach cluster.

Step one within the training process was fine-tuning on the pretrained model. Typically this could require several high-end 80GB A100 GPUs, so we selected to coach a low rank adapter. We treated this as a Causal Language modeling setting and trained for one epoch of examples from the imdb dataset, which features movie reviews and labels indicating whether or not they are of positive or negative sentiment.


Training loss during one epoch of coaching of a gpt-neox-20b model for one epoch on the imdb dataset

With the intention to take the adapted model and perform further finetuning with RL, we first needed to mix the adapted weights, this was achieved by loading the pretrained model and adapter in 16-bit floating point and summary with weight matrices (with the suitable scaling applied).

Finally, we could then fine-tune one other low-rank adapter, on top of the frozen imdb-finetuned model. We use an imdb sentiment classifier to offer the rewards for the RL algorithm.


Mean of rewards when RL fine-tuning of a peft adapted 20B parameter model to generate positive movie reviews.

The complete Weights and Biases report is on the market for this experiment here, if you wish to try more plots and text generations.

Conclusion

We have now implemented a brand new functionality in trl that enables users to fine-tune large language models using RLHF at an inexpensive cost by leveraging the peft and bitsandbytes libraries. We demonstrated that fine-tuning gpt-neo-x (40GB in bfloat16!) on a 24GB consumer GPU is feasible, and we expect that this integration will likely be widely utilized by the community to fine-tune larger models utilizing RLHF and share great artifacts.

We have now identified some interesting directions for the subsequent steps to push the boundaries of this integration

How it will scale within the multi-GPU setting? We’ll mainly explore how this integration will scale with respect to the variety of GPUs, whether it is feasible to use Data Parallelism out-of-the-box or if it’ll require some latest feature adoption on any of the involved libraries.
What tools can we leverage to extend training speed? We have now observed that the foremost downside of this integration is the general training speed. In the longer term we could be keen to explore the possible directions to make the training much faster.

References

Source link

Effective-tuning 20B LLMs with RLHF on a 24GB consumer GPU

Introduction

LLMs & RLHF

What’s TRL?

Training at scale

8-bit matrix multiplication

Low rank adaptation and PEFT

What’s PEFT?

Effective-tuning 20B parameter models with Low Rank Adapters

Step 1: Load your lively model in 8-bit precision

Step 2: Add extra trainable adapters using `peft`

Step 3: Use the identical model to get the reference and lively logits

Overview of the training scripts:

Conclusion

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

On the Possibility of Small Networks for Physics-Informed Learning

Multivariate Probabilistic Time Series Forecasting with Informer

The philosophical puzzle of rational artificial intelligence

AI agents now have their very own Reddit-style social network, and it’s getting weird fast

Establishing a Scalable Sparse Ecosystem with the Universal Sparse Tensor

Effective-tuning 20B LLMs with RLHF on a 24GB consumer GPU

Introduction

LLMs & RLHF

What’s TRL?

Training at scale

8-bit matrix multiplication

Low rank adaptation and PEFT

What’s PEFT?

Effective-tuning 20B parameter models with Low Rank Adapters

Step 1: Load your lively model in 8-bit precision

Step 2: Add extra trainable adapters using peft

Step 3: Use the identical model to get the reference and lively logits

Overview of the training scripts:

Conclusion

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Step 2: Add extra trainable adapters using `peft`

What are your thoughts on this topic?
Let us know in the comments below.