ORPO: Preference Optimization without the Supervised Positive-tuning (SFT) Step

A less expensive alignment method performing in addition to DPO

There are actually many methods to align large language models (LLMs) with human preferences. Reinforcement learning with human feedback (RLHF) was one in all the primary and brought us ChatGPT, but RLHF may be very costly. DPO, IPO, and KTO are notably cheaper than RLHF as they don’t need a reward model.

While DPO and IPO are cheaper, they still require to coach two different models. One model for the supervised fine-tuning (SFT) step, i.e., training the model to reply instructions, after which the model to align with human preferences using the SFT model for initialization and as a reference.

ORPO is one more latest method for LLM alignment but this one doesn’t even need the SFT model. With ORPO, the LLM jointly learns to reply instructions and human preferences.

In this text, I explain ORPO and review its performance. I show the best way to use it to show Mistral 7B right into a chat model using consumer hardware.

ORPO is presented on this paper:

ORPO: Monolithic Preference Optimization without Reference Model

ORPO: Preference Optimization without the Supervised Positive-tuning (SFT) Step

A less expensive alignment method performing in addition to DPO

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

One Model to Rule Them All? SAP-RPT-1 and the Way forward for Tabular Foundation Models

Simo sounds alarm on OpenAI’s ‘side quests’

Measuring Progress Towards AGI: A Cognitive Framework

Sustaining diplomacy amid competition in US-China relations

The Pentagon is planning for AI firms to coach on classified data, defense official says

ORPO: Preference Optimization without the Supervised Positive-tuning (SFT) Step

A less expensive alignment method performing in addition to DPO

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.