Training models to know and predict human preferences might be incredibly complex. Traditional methods, like supervised fine-tuning, often require assigning specific labels to data, which is just not cost-efficient, especially for nuanced tasks. Preference optimization is another approach that may simplify this process and yield more accurate results. By specializing in comparing and rating candidate answers moderately than assigning fixed labels, preference optimization allows models to capture the subtleties of human judgment more effectively.
Preference optimization is widely used for fine-tuning language models, but it could even be applied to vision language models (VLM).
We’re excited to announce that the TRL library now supports direct preference optimization (DPO) for VLMs. This text will guide you thru the technique of training VLMs using TRL and DPO.
Preference dataset
Preference optimization requires data that captures user preferences. Within the binary alternative setting, each example consists of a prompt, and two candidate answers: one which is chosen and one which is rejected. The model’s goal is to learn to predict the chosen answer over the rejected one.
For instance, it’s essential have samples like the next:

❔ Query: What number of families?
- ❌ Rejected: The image doesn’t provide any details about families.
- ✅ Chosen: The image shows a Union Organization table setup with 18,000 families.
Note that the chosen message is just not necessarily correct. For instance, the chosen response that claims 18,000 families continues to be improper, but it surely’s less improper in comparison with the rejected response.
For this blog post, we’ll be using the openbmb/RLAIF-V-Dataset, which incorporates over 83,000 annotated rows. Let’s take a better take a look at the dataset:
>>> from datasets import load_dataset
>>> dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train[:1%]")
>>> sample = dataset[1]
>>> sample["image"].show()
>>> sample["question"]
'what number of families?'
>>> sample["rejected"]
'The image doesn't provide any details about families.'
>>> sample["chosen"]
'The image shows a Union Organization table setup with 18,000 families.'
Our model requires each text and pictures as input, so step one is to format the dataset to suit this requirement. The information must be structured to simulate a conversation between a user and an assistant. The user provides a prompt that features a picture and a matter, while the assistant responds with a solution. Here’s how this formatting is finished:
from datasets import features
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
def format(example):
prompt = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": example["question"]}],
},
]
chosen = [
{
"role": "assistant",
"content": [{"type": "text", "text": example["chosen"]}],
},
]
rejected = [
{
"role": "assistant",
"content": [{"type": "text", "text": example["rejected"]}],
},
]
prompt = processor.apply_chat_template(prompt, tokenize=False)
chosen = processor.apply_chat_template(chosen, tokenize=False)
rejected = processor.apply_chat_template(rejected, tokenize=False)
max_size = processor.image_processor.size["longest_edge"]
example["image"].thumbnail((max_size, max_size))
return {"images": [example["image"]], "prompt": prompt, "chosen": chosen, "rejected": rejected}
dataset = dataset.map(format, remove_columns=dataset.column_names)
f = dataset.features
f["images"] = features.Sequence(features.Image(decode=True))
dataset = dataset.forged(f)
Our dataset is now formatted. Let’s have a take a look at the primary example:
>>> dataset[1]
{'images': [0x154505570>],
'prompt': 'User:what number of families?n' ,
'rejected': 'Assistant: The image doesn't provide any details about families.n' ,
'chosen': 'Assistant: The image shows a Union Organization table setup with 18,000 families.n' }
Warm up your GPUs, the dataset is prepared for training!
Training
For the sake of the instance, we’ll be training the Idefics2-8b model, but note that the DPO implementation in TRL supports other models like Llava 1.5 and PaliGemma. More information in Section Finetuning Llava 1.5, PaliGemma and others. Before looking into the training process, we’ll first ensure every little thing matches easily into memory.
How much memory do I would like?
I even have a GPU with 80GB of VRAM. Is it enough to coach my Idefics2-8b model? Listed here are the calculation steps to get a rough estimate of the memory needed.
Let be the variety of parameters, the precision. The next components may have to suit together in memory:
- Model to coach:
- Reference model: the reference model is similar because the model to coach, so it also requires
- Gradients: we train the entire model, and every parameter requires a gradient, so it requires
- Optimizer states: we use AdamW, which requires two states per parameter, so it requires
Idefics2-8b has 8 billion parameters, and we use float32 precision which requires 4 bytes per float. So the entire memory required is:
| Component | Calculation | Memory |
|---|---|---|
| Model to coach | 32 GB | |
| Reference model | 32 GB | |
| Gradients | 32 GB | |
| Optimizer states | 64 GB | |
| Total | 160 GB |
That is way above my GPU’s memory capability. Fortunately, by applying techniques resembling quantization and LoRA, we are able to significantly reduce the memory requirements and make the training feasible. Let’s examine the best way to do that.
Quantization
Quantization is a way that reduces the precision of the model’s weights and activations. Switching from float32 to bfloat16 precision halves the storage requirement per parameter from 4 bytes to 2 bytes. This optimization conserves memory while also accelerating computations, ensuring high performance with minimal compromise.
To implement bfloat16 precision within the model:
import torch
from transformers import AutoModelForVision2Seq
model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b", torch_dtype=torch.bfloat16)
bfloat16 precision can be applied to the optimizer by setting bf16=True within the training arguments:
from transformers import TrainingArguments
training_args = TrainingArguments(..., bf16=True)
LoRA
LoRA is a technique that reduces the variety of trainable parameters by learning pairs of rank-decomposition matrices while keeping the unique weights frozen. This significantly decreases the storage needs for LLM adapted to specific tasks. LoRA is integrated in PEFT and you possibly can set it up very quickly:
from transformers import AutoModelForVision2Seq
+ from peft import get_peft_model, LoraConfig
model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b")
+ peft_config = LoraConfig(target_modules="all-linear")
+ model = get_peft_model(model, peft_config)
PEFT acts like a wrapper (called adaptater) across the model. That is the adapter that might be trained while the inner model is kept frozen. How much does LoRA reduce the variety of trainable parameters?
>>> model.print_trainable_parameters()
trainable params: 55,348,736 || all params: 8,458,116,848 || trainable%: 0.6543860411799315
It reduces the variety of trainable parameters from 8 billion to 55 million, which is a large gap, and it is going to significantly reduce the memory requirements.
The brand new memory requirements after quantization and LoRA
Now that we now have reduced the memory requirements, let’s recalculate the memory needed:
| Component | Calculation | Memory |
|---|---|---|
| Model to coach | 16 GB | |
| Reference model | 16 GB | |
| Gradients | 0.1 GB | |
| Optimizer states | 0.2 GB | |
| Total | 32.3 GB |
This time, we want around 32GB of memory to finetune our Idefics2-8b model, which is rather more reasonable and matches inside my GPU!
For added information on optimizing memory usage using LoRA and QLoRA, discuss with the PEFT documentation or LoRA and QLoRA Google’s recommendations for LLMs.
What in regards to the batch size?
Our memory calculation is not exact because it doesn’t account for activations. Activations are the intermediate outputs of the network layers and their memory requirements rely upon the model structure and batch size. Precisely calculating the memory needed for activations is difficult, so we’ll depend on empirical observations.
To decide on an appropriate training batch size (per_device_train_batch_size), start together with your desired batch size (e.g., 64). This may likely end in an out-of-memory (OOM) error. If it does, reduce the batch size by half and double the gradient accumulation steps (gradient_accumulation_steps) to take care of the identical effective batch size. Repeat this process until the memory matches inside your GPU. In our case, we find yourself with a batch size of two and gradient accumulation steps of 32.
An extra optimization is to make use of gradient checkpointing (gradient_checkpointing) to scale back the memory needed for activations. This system trades off compute for memory by recomputing parts of the network through the backward pass. It will probably be enabled by setting gradient_checkpointing=True within the training arguments.
Summary: complete training script
Now that we have arrange the model, dataset, and training parameters, we’re able to train. Here’s the best way to put every little thing together in a script, including some additional elements to hurry up processing, like dataset_num_proc and dataloader_num_workers:
from datasets import features, load_dataset
from transformers import AutoModelForVision2Seq, AutoProcessor
import torch
from trl import DPOConfig, DPOTrainer
from peft import LoraConfig
def important():
model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train")
def format(example):
prompt = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": example["question"]}]}]
chosen = [{"role": "assistant", "content": [{"type": "text", "text": example["chosen"]}]}]
rejected = [{"role": "assistant", "content": [{"type": "text", "text": example["rejected"]}]}]
prompt = processor.apply_chat_template(prompt, tokenize=False)
chosen = processor.apply_chat_template(chosen, tokenize=False)
rejected = processor.apply_chat_template(rejected, tokenize=False)
max_size = processor.image_processor.size["longest_edge"] // 2
example["image"].thumbnail((max_size, max_size))
return {"images": [example["image"]], "prompt": prompt, "chosen": chosen, "rejected": rejected}
dataset = dataset.map(format, remove_columns=dataset.column_names, num_proc=32)
f = dataset.features
f["images"] = features.Sequence(features.Image(decode=True))
dataset = dataset.forged(f)
training_args = DPOConfig(
output_dir="idefics2-8b-dpo",
bf16=True,
gradient_checkpointing=True,
per_device_train_batch_size=2,
gradient_accumulation_steps=32,
num_train_epochs=1,
dataset_num_proc=32,
dataloader_num_workers=32,
logging_steps=10,
)
trainer = DPOTrainer(
model,
ref_model=None,
args=training_args,
train_dataset=dataset,
tokenizer=processor,
peft_config=LoraConfig(target_modules="all-linear"),
)
trainer.train()
if __name__ == "__main__":
important()
Let’s run and wait… 🚀
speed up launch dpo_idefics2-8b.py
Results
Just a few hours later, the training is complete. Let’s take a take a look at the training curves:
In DPO, we give attention to several metrics to evaluate the standard of the training:
- Accuracy: This metric indicates the share of coaching samples where the model is more more likely to output the chosen answer moderately than the rejected answer. We will see a rise in accuracy, which is a positive sign.
- Rewards: Rewards are related to the probability of a solution being chosen. For more details, discuss with DPO paper, Section 5. We expect the reward for the chosen answer to be higher than for the rejected answer. To confirm this, we take a look at the reward margin, which is the difference between the rewards for the chosen and rejected answers. An increasing reward margin, as observed here, can also be a great sign.
Evaluation
Inference
With the model training complete, the subsequent step is to judge its performance on some examples. This may give us a way of how well the model has learned and the way effectively it could make predictions. Here’s a script to assist you to evaluate the model and analyze its performance on a set of test examples:
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b").to("cuda")
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
model.load_adapter("HuggingFaceH4/idefics2-8b-dpo-rlaif-v-v0.3")
user_message = ...
image_path = ...
data = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": user_message}]}]
prompts = processor.apply_chat_template(data, add_generation_prompt=True)
images = [Image.open(image_path)]
inputs = processor(prompts, images, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
generated_ids = model.generate(**inputs, max_new_tokens=500)
response_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response_text)
As mentioned above, the openbmb/RLAIF-V-Dataset is designed to scale back hallucinations. But has the fine-tuning actually reduced hallucinations? To search out out, we are able to use the AMBER benchmark, a dataset specifically created to judge hallucinations in VLMs. We are going to report the outcomes for Idefics2 and Idefics2+DPO on the discriminative task and compare them with other models for reference.
| Accuracy | F1 | |
|---|---|---|
| GPT-4o | 88.8 | 91.6 |
| Idefics2+DPO | 85.9 | 89.4 |
| Idefics2 | 85.8 | 89.1 |
| GPT-4v | 83.4 | 87.4 |
| MiniGemini | 82.6 | 87.6 |
| LLaVA-NeXT | 81.4 | 85.4 |
| QWEN-VL | 81.9 | 86.4 |
| LURE | 73.5 | 77.7 |
| OPERA | 75.2 | 78.3 |
| Less-is-more | 72.4 | 75.8 |
| VCD | 71.8 | 74.9 |
Overall, the fine-tuned model seems to hallucinate a bit less. The training seems to have been successful!
Listed here are some cherry-picked examples as an example the model’s performance:
| Image | Query | Idefics2 | Idefics2+DPO |
|---|---|---|---|
![]() |
Are there two ships on this image? | Yes | No |
![]() |
Is the bottom uneven on this image? | No | Yes |
![]() |
Is there one shovel on this image? | Yes | No |
Try it yourself and see how the model performs on your personal examples!
Finetuning Llava 1.5, PaliGemma and others
On the time of writing, the DPO implementation in TRL supports Idefics2, Llava 1.5, and PaliGemma, with ongoing efforts so as to add support for more models. The simplest approach to fine-tune these models is to make use of the example script provided within the TRL repository. For instance, to finetune PaliGemma, you should utilize the next command:
speed up launch examples/scripts/dpo_visual.py
--dataset_name HuggingFaceH4/rlaif-v_formatted
--model_name_or_path google/paligemma-3b-pt-224
--per_device_train_batch_size 2
--gradient_accumulation_steps 32
--dataset_num_proc 32
--output_dir dpo_paligemma_rlaif-v
--bf16
--torch_dtype bfloat16
--gradient_checkpointing
--use_peft
--lora_target_modules=all-linear
You’ll find an in depth give attention to PaliGemma finetuning within the smol-vision project.
🚀🚀 Now you’ve got every little thing it’s essential start fine-tuning your personal VLMs with DPO. Share your findings, models, and datasets with the community!




