Preference Optimization for Vision Language Models

Training models to know and predict human preferences might be incredibly complex. Traditional methods, like supervised fine-tuning, often require assigning specific labels to data, which is just not cost-efficient, especially for nuanced tasks. Preference optimization is another approach that may simplify this process and yield more accurate results. By specializing in comparing and rating candidate answers moderately than assigning fixed labels, preference optimization allows models to capture the subtleties of human judgment more effectively.

Preference optimization is widely used for fine-tuning language models, but it could even be applied to vision language models (VLM).
We’re excited to announce that the TRL library now supports direct preference optimization (DPO) for VLMs. This text will guide you thru the technique of training VLMs using TRL and DPO.

Preference dataset

Preference optimization requires data that captures user preferences. Within the binary alternative setting, each example consists of a prompt, and two candidate answers: one which is chosen and one which is rejected. The model’s goal is to learn to predict the chosen answer over the rejected one.
For instance, it’s essential have samples like the next:

❔ Query: What number of families?

❌ Rejected: The image doesn’t provide any details about families.
✅ Chosen: The image shows a Union Organization table setup with 18,000 families.

Note that the chosen message is just not necessarily correct. For instance, the chosen response that claims 18,000 families continues to be improper, but it surely’s less improper in comparison with the rejected response.

For this blog post, we’ll be using the openbmb/RLAIF-V-Dataset, which incorporates over 83,000 annotated rows. Let’s take a better take a look at the dataset:

>>> from datasets import load_dataset
>>> dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train[:1%]")
>>> sample = dataset[1]
>>> sample["image"].show()
>>> sample["question"]
'what number of families?'
>>> sample["rejected"]
'The image doesn't provide any details about families.'
>>> sample["chosen"]
'The image shows a Union Organization table setup with 18,000 families.'

Our model requires each text and pictures as input, so step one is to format the dataset to suit this requirement. The information must be structured to simulate a conversation between a user and an assistant. The user provides a prompt that features a picture and a matter, while the assistant responds with a solution. Here’s how this formatting is finished:

from datasets import features
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)

def format(example):
    
    prompt = [
        {
            "role": "user",
            "content": [{"type": "image"}, {"type": "text", "text": example["question"]}],
        },
    ]
    chosen = [
        {
            "role": "assistant",
            "content": [{"type": "text", "text": example["chosen"]}],
        },
    ]
    rejected = [
        {
            "role": "assistant",
            "content": [{"type": "text", "text": example["rejected"]}],
        },
    ]
    
    prompt = processor.apply_chat_template(prompt, tokenize=False)
    chosen = processor.apply_chat_template(chosen, tokenize=False)
    rejected = processor.apply_chat_template(rejected, tokenize=False)
    
    
    max_size = processor.image_processor.size["longest_edge"]
    example["image"].thumbnail((max_size, max_size))
    return {"images": [example["image"]], "prompt": prompt, "chosen": chosen, "rejected": rejected}



dataset = dataset.map(format, remove_columns=dataset.column_names)



f = dataset.features
f["images"] = features.Sequence(features.Image(decode=True))  
dataset = dataset.forged(f)

Our dataset is now formatted. Let’s have a take a look at the primary example:

>>> dataset[1]
{'images': [0x154505570>],
 'prompt': 'User:what number of families?n',
 'rejected': 'Assistant: The image doesn't provide any details about families.n',
 'chosen': 'Assistant: The image shows a Union Organization table setup with 18,000 families.n'}

Warm up your GPUs, the dataset is prepared for training!

Training

For the sake of the instance, we’ll be training the Idefics2-8b model, but note that the DPO implementation in TRL supports other models like Llava 1.5 and PaliGemma. More information in Section Finetuning Llava 1.5, PaliGemma and others. Before looking into the training process, we’ll first ensure every little thing matches easily into memory.

How much memory do I would like?

I even have a GPU with 80GB of VRAM. Is it enough to coach my Idefics2-8b model? Listed here are the calculation steps to get a rough estimate of the memory needed.

Let $N$

Model to coach: $N \times P$
Reference model: the reference model is similar because the model to coach, so it also requires $N \times P$
Gradients: we train the entire model, and every parameter requires a gradient, so it requires $N \times P$
Optimizer states: we use AdamW, which requires two states per parameter, so it requires $2 \times N \times P$

Idefics2-8b has 8 billion parameters, and we use float32 precision which requires 4 bytes per float. So the entire memory required is:

Component	Calculation	Memory
Model to coach	$8 \times 10^{9} \times 4$	32 GB
Reference model	$8 \times 10^{9} \times 4$	32 GB
Gradients	$8 \times 10^{9} \times 4$	32 GB
Optimizer states	$2 \times 8 \times 10^{9} \times 4$	64 GB
Total		160 GB

That is way above my GPU’s memory capability. Fortunately, by applying techniques resembling quantization and LoRA, we are able to significantly reduce the memory requirements and make the training feasible. Let’s examine the best way to do that.

Quantization

Quantization is a way that reduces the precision of the model’s weights and activations. Switching from float32 to bfloat16 precision halves the storage requirement per parameter from 4 bytes to 2 bytes. This optimization conserves memory while also accelerating computations, ensuring high performance with minimal compromise.
To implement bfloat16 precision within the model:

import torch
from transformers import AutoModelForVision2Seq

model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b", torch_dtype=torch.bfloat16)

bfloat16 precision can be applied to the optimizer by setting bf16=True within the training arguments:

from transformers import TrainingArguments

training_args = TrainingArguments(..., bf16=True)

LoRA

LoRA is a technique that reduces the variety of trainable parameters by learning pairs of rank-decomposition matrices while keeping the unique weights frozen. This significantly decreases the storage needs for LLM adapted to specific tasks. LoRA is integrated in PEFT and you possibly can set it up very quickly:

  from transformers import AutoModelForVision2Seq
+ from peft import get_peft_model, LoraConfig

  model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b")
+ peft_config = LoraConfig(target_modules="all-linear")
+ model = get_peft_model(model, peft_config)

PEFT acts like a wrapper (called adaptater) across the model. That is the adapter that might be trained while the inner model is kept frozen. How much does LoRA reduce the variety of trainable parameters?

>>> model.print_trainable_parameters()
trainable params: 55,348,736 || all params: 8,458,116,848 || trainable%: 0.6543860411799315

It reduces the variety of trainable parameters from 8 billion to 55 million, which is a large gap, and it is going to significantly reduce the memory requirements.

The brand new memory requirements after quantization and LoRA

Now that we now have reduced the memory requirements, let’s recalculate the memory needed:

Component	Calculation	Memory
Model to coach	$8 G \times 2 8 mathrm{G} times 2$	16 GB
Reference model	$8 G \times 2 8 mathrm{G} times 2$	16 GB
Gradients	$55 M \times 2 55 mathrm{M} times 2$	0.1 GB
Optimizer states	$2 \times 55 M \times 2 2 times 55 mathrm{M} times 2$	0.2 GB
Total		32.3 GB

This time, we want around 32GB of memory to finetune our Idefics2-8b model, which is rather more reasonable and matches inside my GPU!

For added information on optimizing memory usage using LoRA and QLoRA, discuss with the PEFT documentation or LoRA and QLoRA Google’s recommendations for LLMs.

What in regards to the batch size?

Our memory calculation is not exact because it doesn’t account for activations. Activations are the intermediate outputs of the network layers and their memory requirements rely upon the model structure and batch size. Precisely calculating the memory needed for activations is difficult, so we’ll depend on empirical observations.

To decide on an appropriate training batch size (per_device_train_batch_size), start together with your desired batch size (e.g., 64). This may likely end in an out-of-memory (OOM) error. If it does, reduce the batch size by half and double the gradient accumulation steps (gradient_accumulation_steps) to take care of the identical effective batch size. Repeat this process until the memory matches inside your GPU. In our case, we find yourself with a batch size of two and gradient accumulation steps of 32.

An extra optimization is to make use of gradient checkpointing (gradient_checkpointing) to scale back the memory needed for activations. This system trades off compute for memory by recomputing parts of the network through the backward pass. It will probably be enabled by setting gradient_checkpointing=True within the training arguments.

Summary: complete training script

Now that we have arrange the model, dataset, and training parameters, we’re able to train. Here’s the best way to put every little thing together in a script, including some additional elements to hurry up processing, like dataset_num_proc and dataloader_num_workers:


from datasets import features, load_dataset
from transformers import AutoModelForVision2Seq, AutoProcessor
import torch
from trl import DPOConfig, DPOTrainer
from peft import LoraConfig


def important():
    
    model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b", torch_dtype=torch.bfloat16)
    processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)

    
    dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train")

    def format(example):
        
        prompt = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": example["question"]}]}]
        chosen = [{"role": "assistant", "content": [{"type": "text", "text": example["chosen"]}]}]
        rejected = [{"role": "assistant", "content": [{"type": "text", "text": example["rejected"]}]}]
        
        prompt = processor.apply_chat_template(prompt, tokenize=False)
        chosen = processor.apply_chat_template(chosen, tokenize=False)
        rejected = processor.apply_chat_template(rejected, tokenize=False)
        
        
        max_size = processor.image_processor.size["longest_edge"] // 2
        example["image"].thumbnail((max_size, max_size))
        return {"images": [example["image"]], "prompt": prompt, "chosen": chosen, "rejected": rejected}

    
    dataset = dataset.map(format, remove_columns=dataset.column_names, num_proc=32)

    
    
    f = dataset.features
    f["images"] = features.Sequence(features.Image(decode=True))
    dataset = dataset.forged(f)

    
    training_args = DPOConfig(
        output_dir="idefics2-8b-dpo",
        bf16=True,
        gradient_checkpointing=True,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=32,
        num_train_epochs=1,
        dataset_num_proc=32,  
        dataloader_num_workers=32,  
        logging_steps=10,
    )
    trainer = DPOTrainer(
        model,
        ref_model=None,  
        args=training_args,
        train_dataset=dataset,
        tokenizer=processor,
        peft_config=LoraConfig(target_modules="all-linear"),
    )

    trainer.train()


if __name__ == "__main__":
    important()

Let’s run and wait… 🚀

speed up launch dpo_idefics2-8b.py

Results

Just a few hours later, the training is complete. Let’s take a take a look at the training curves:

In DPO, we give attention to several metrics to evaluate the standard of the training:

Accuracy: This metric indicates the share of coaching samples where the model is more more likely to output the chosen answer moderately than the rejected answer. We will see a rise in accuracy, which is a positive sign.
Rewards: Rewards are related to the probability of a solution being chosen. For more details, discuss with DPO paper, Section 5. We expect the reward for the chosen answer to be higher than for the rejected answer. To confirm this, we take a look at the reward margin, which is the difference between the rewards for the chosen and rejected answers. An increasing reward margin, as observed here, can also be a great sign.

Evaluation

Inference

With the model training complete, the subsequent step is to judge its performance on some examples. This may give us a way of how well the model has learned and the way effectively it could make predictions. Here’s a script to assist you to evaluate the model and analyze its performance on a set of test examples:

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b").to("cuda")
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
model.load_adapter("HuggingFaceH4/idefics2-8b-dpo-rlaif-v-v0.3")  


user_message = ...
image_path = ...
data = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": user_message}]}]
prompts = processor.apply_chat_template(data, add_generation_prompt=True)  
images = [Image.open(image_path)]
inputs = processor(prompts, images, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}


generated_ids = model.generate(**inputs, max_new_tokens=500)
response_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response_text)

As mentioned above, the openbmb/RLAIF-V-Dataset is designed to scale back hallucinations. But has the fine-tuning actually reduced hallucinations? To search out out, we are able to use the AMBER benchmark, a dataset specifically created to judge hallucinations in VLMs. We are going to report the outcomes for Idefics2 and Idefics2+DPO on the discriminative task and compare them with other models for reference.

	Accuracy	F1
GPT-4o	88.8	91.6
Idefics2+DPO	85.9	89.4
Idefics2	85.8	89.1
GPT-4v	83.4	87.4
MiniGemini	82.6	87.6
LLaVA-NeXT	81.4	85.4
QWEN-VL	81.9	86.4
LURE	73.5	77.7
OPERA	75.2	78.3
Less-is-more	72.4	75.8
VCD	71.8	74.9

Overall, the fine-tuned model seems to hallucinate a bit less. The training seems to have been successful!

Listed here are some cherry-picked examples as an example the model’s performance:

Query	Idefics2	Idefics2+DPO
Are there two ships on this image?	Yes	No
Is the bottom uneven on this image?	No	Yes
Is there one shovel on this image?	Yes	No

Try it yourself and see how the model performs on your personal examples!

Finetuning Llava 1.5, PaliGemma and others

On the time of writing, the DPO implementation in TRL supports Idefics2, Llava 1.5, and PaliGemma, with ongoing efforts so as to add support for more models. The simplest approach to fine-tune these models is to make use of the example script provided within the TRL repository. For instance, to finetune PaliGemma, you should utilize the next command:

speed up launch examples/scripts/dpo_visual.py 
    --dataset_name HuggingFaceH4/rlaif-v_formatted 
    --model_name_or_path google/paligemma-3b-pt-224 
    --per_device_train_batch_size 2 
    --gradient_accumulation_steps 32 
    --dataset_num_proc 32 
    --output_dir dpo_paligemma_rlaif-v 
    --bf16 
    --torch_dtype bfloat16 
    --gradient_checkpointing 
    --use_peft 
    --lora_target_modules=all-linear

You’ll find an in depth give attention to PaliGemma finetuning within the smol-vision project.

🚀🚀 Now you’ve got every little thing it’s essential start fine-tuning your personal VLMs with DPO. Share your findings, models, and datasets with the community!

Source link

Preference Optimization for Vision Language Models

Preference dataset

Training

How much memory do I would like?

Quantization

LoRA

The brand new memory requirements after quantization and LoRA

What in regards to the batch size?

Summary: complete training script

Results

Evaluation

Inference

Finetuning Llava 1.5, PaliGemma and others

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Constructing Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

5 Latest Digital Twin Products Developers Can Use to Construct 6G Networks

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Preference Optimization for Vision Language Models

Preference dataset

Training

How much memory do I would like?

Quantization

LoRA

The brand new memory requirements after quantization and LoRA

What in regards to the batch size?

Summary: complete training script

Results

Evaluation

Inference

Finetuning Llava 1.5, PaliGemma and others

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.