ORPO: Preference Optimization without the Supervised Positive-tuning (SFT) Step

-

A less expensive alignment method performing in addition to DPO

Generated with DALL-E

There are actually many methods to align large language models (LLMs) with human preferences. Reinforcement learning with human feedback (RLHF) was one in all the primary and brought us ChatGPT, but RLHF may be very costly. DPO, IPO, and KTO are notably cheaper than RLHF as they don’t need a reward model.

While DPO and IPO are cheaper, they still require to coach two different models. One model for the supervised fine-tuning (SFT) step, i.e., training the model to reply instructions, after which the model to align with human preferences using the SFT model for initialization and as a reference.

ORPO is one more latest method for LLM alignment but this one doesn’t even need the SFT model. With ORPO, the LLM jointly learns to reply instructions and human preferences.

In this text, I explain ORPO and review its performance. I show the best way to use it to show Mistral 7B right into a chat model using consumer hardware.

ORPO is presented on this paper:

ORPO: Monolithic Preference Optimization without Reference Model

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x