Reinforcement Learning from Human Feedback, Explained Simply

-

The looks of ChatGPT in 2022 completely modified how the world began perceiving artificial intelligence. The incredible performance of ChatGPT led to the rapid development of other powerful LLMs.

We could roughly say that ChatGPT is an upgraded version of GPT-3. But as compared to the previous GPT versions, this time OpenAI developers not only used more data or simply complex model architectures. As an alternative, they designed an incredible technique that allowed a breakthrough.

NLP development before ChatGPT

To raised dive into the context, allow us to remind ourselves how LLMs were developed up to now, before ChatGPT. Usually, LLM development consisted of two stages:

Pre-training & fine-tuning framework

Pre-training includes language modeling — a task by which a model tries to predict a hidden token within the context. The probability distribution produced by the model for the hidden token is then in comparison with the bottom truth distribution for loss calculation and further backpropagation. In this manner, the model learns the semantic structure of the language and the meaning behind words.

After that, the model is fine-tuned on a downstream task, which could include different objectives: text summarization, text translation, text generation, query answering, etc. In lots of situations, fine-tuning requires a human-labeled dataset, which should preferably contain enough text samples to permit the model to generalize its learning well and avoid overfitting.

That is where the bounds of fine-tuning appear. Data annotation is generally a time-consuming task performed by humans. Allow us to take a question-answering task, for instance. To construct training samples, we would wish a manually labeled dataset of questions and answers. For each query, we would wish a precise answer provided by a human. For example:

During data annotation, providing full answers to prompts requires quite a lot of human time.

In point of fact, for training an LLM, we would wish thousands and thousands and even billions of such (query, answer) pairs. This annotation process may be very time-consuming and doesn’t scale well.

RLHF

Having understood the predominant problem, now it is ideal moment to dive into the small print of RLHF.

If you could have already used ChatGPT, you could have probably encountered a situation by which ChatGPT asks you to decide on the reply that higher suits your initial prompt:

This information is definitely used to repeatedly improve ChatGPT. Allow us to understand how.

To start with, it will be important to note that selecting the very best answer amongst two options is a much simpler task for a human than providing a precise answer to an open query. The concept we’re going to have a look at is predicated exactly on that: we would like the human to only select a solution from two possible options to create the annotated dataset.

Response generation

In LLMs, there are several possible ways to generate a response from the distribution of predicted token probabilities:

  • Having an output distribution  over tokens, the model at all times deterministically chooses the token with the very best probability.
  • Having an output distribution  over tokens, the model randomly samples a token in accordance with its assigned probability.

This second sampling method leads to more randomized model behavior, which allows the generation of diverse text sequences. For now, allow us to suppose that we generate many pairs of such sequences. The resulting dataset of pairs is labeled by humans: for each pair, a human is asked which of the 2 output sequences matches the input sequence higher. The annotated dataset is utilized in the subsequent step.

Reward Model

After the annotated dataset is created, we use it to coach a so-called “reward” model, whose goal is to learn to numerically estimate how good or bad a given answer is for an initial prompt. Ideally, we would like the reward model to generate positive values for good responses and negative values for bad responses.

Loss function

You may logically ask how the reward model will learn this regression task if there are usually not numerical labels within the annotated dataset. That is an affordable query. To deal with it, we’re going to use an interesting trick: we are going to pass each a very good and a foul answer through the reward model, which is able to ultimately output two different estimates (rewards).

Then we are going to smartly construct a loss function that may compare them relatively.

Loss function utilized in the RLHF algorithm. R₊ refers back to the reward assigned to the higher response while R₋ is a reward estimated for the more serious response.

Allow us to plug in some argument values for the loss function and analyze its behavior. Below is a table with the plugged-in values:

A table of loss values depending on the difference between R₊ and R₋. 

We are able to immediately observe two interesting insights:

  • , i.e. a greater response received a lower reward than a worse one, then the loss value might be proportionally large to the reward difference, meaning that the model must be significantly adjusted.
  • , i.e. a greater response received the next reward than a worse one, then the loss might be bounded inside much lower values within the interval (0, 0.69), which indicates that the model does its job well at distinguishing good and bad responses.

Training an original LLM

The trained reward model is then used to coach the unique LLM. For that, we will feed a series of recent prompts to the LLM, which is able to generate output sequences. Then the input prompts, together with the output sequences, are fed to the reward model to estimate how good those responses are.

After generating numerical estimates, that information is used as feedback to the unique LLM, which then performs weight updates. A quite simple but elegant approach!

RLHF training diagram

Inference

During inference, only the unique trained model is used. At the identical time, the model can repeatedly be improved within the background by collecting user prompts and periodically asking them to rate which of two responses is healthier.

Conclusion

In this text, we have now studied RLHF — a highly efficient and scalable technique to coach modern LLMs. A sublime combination of an LLM with a reward model allows us to significantly simplify the annotation task performed by humans, which required huge efforts up to now when done through raw fine-tuning procedures.

Resources

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x