ORPO: Preference Optimization without the Supervised Positive-tuning (SFT) Step

Artificial Intelligence

ORPO: Preference Optimization without the Supervised Positive-tuning (SFT) Step

admin

April 10, 2024

ORPO: Preference Optimization without the Supervised Positive-tuning (SFT) Step

A less expensive alignment method performing in addition to DPO

There are actually many methods to align large language models (LLMs) with human preferences. Reinforcement learning with human feedback (RLHF) was one in all the primary and brought us ChatGPT, but RLHF may be very costly. DPO, IPO, and KTO are notably cheaper than RLHF as they don’t need a reward model.

While DPO and IPO are cheaper, they still require to coach two different models. One model for the supervised fine-tuning (SFT) step, i.e., training the model to reply instructions, after which the model to align with human preferences using the SFT model for initialization and as a reference.

ORPO is one more latest method for LLM alignment but this one doesn’t even need the SFT model. With ORPO, the LLM jointly learns to reply instructions and human preferences.

In this text, I explain ORPO and review its performance. I show the best way to use it to show Mistral 7B right into a chat model using consumer hardware.

ORPO is presented on this paper:

ORPO: Monolithic Preference Optimization without Reference Model

A less expensive alignment method performing in addition to DPO

LEAVE A REPLY Cancel reply