Vision Language Model Alignment in TRL ⚡️

Vision Language Models (VLMs) are getting stronger, but aligning them to human preferences still matters. In TRL, we already showed the best way to post-train VLMs with Supervised High-quality-Tuning (SFT) and Direct Preference Optimization (DPO). This time, we’re going further.

tl;dr Here’s what’s latest in TRL:

Mixed Preference Optimization (MPO)
Group Relative Policy Optimization (GRPO)
Group Sequence Policy Optimization (GSPO) (a variant of GRPO)

These transcend pairwise DPO, extracting richer signals from preference data and scaling higher with modern VLMs.

We’ve also prolonged existing methods to support VLMs:

Reinforce Leave One Out (RLOO)
Online Direct Preference Optimization (Online DPO)

This allows more efficient and scalable multimodal alignment.

Finally:

Native Supervised High-quality-tuning support for Vision Language Models
Training scripts and demo notebooks to make it easier to start quickly

Alignment for Vision Language Models

Traditionally, you’ll take a base model, apply SFT to follow instructions, after which apply DPO to align it to preferential data. Previously, we adapted this approach to Vision Language Models (VLMs) and validated it on IDEFICS2, showing improvement in model responses.

DPO works by optimizing preferences between pairs of model responses using a contrastive loss: you’ve gotten a selected and a rejected answer and also you optimize your preferences based on what you would like and don’t want.

But within the last yr, latest multimodal alignment methods have gained popularity, GRPO and MPO, that may push VLM performance even further. At the tip of the blog post you’ll find a table that showcases the differences between model responses.

Mixed Preference Optimization (MPO)

Aligning multimodal models with SFT to do reasoning tasks falls short as a result of distribution shift. Meanwhile, models aligned with DPO fail to generate coherent rationales and might generate repetitive responses. To deal with this, there’s a brand new technique called Mixed Preference Optimization (MPO) specifically made for multimodal models. This method is actually an extension of DPO with multiple losses: preference loss from DPO (sigmoid), quality loss from Binary Classifier Optimization (BCO), and generation loss from SFT. In keeping with the paper, simply switching to this combined loss leads to 6.2 pts improvement in MathVista!

Since this is just modifying the loss, we added combined loss support to TRL’s DPOTrainer class. To make use of it, you’ll be able to initialize the DPOConfig as follows:

mpo_config = DPOConfig(
    loss_type=["sigmoid", "bco_pair", "sft"], 
    loss_weights=[0.8, 0.2, 1.0], 
)

Then initialize the DPOTrainer:

mpo_trainer = DPOTrainer(
    model=model_id,
    args=mpo_config,
    processing_class=tokenizer,
    train_dataset=dataset,
)
mpo_trainer.train()

And that’s it! If you would like to explore further, you’ll find a whole notebook example here.

Multimodal Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) is a cutting-edge alignment method initially introduced in DeepSeek Math paper and later integrated to DeepSeek R1, the groundbreaking LLM. It’s an addition to PPO where the policy updates are done over groups (batches of trajectories that represent how a dialogue rolls out). This feature makes it more robust to reward noise, because the noise averages out inside groups. For the reason that model learns broader sense of a superb response quite than singular high reward samples, this method also makes the model highly performant.

In TRL, we now introduce GRPO support for vision language models. We won’t provide a full training script example, as you’ll find it within the notebook. As a substitute, we’ll concentrate on highlighting the important thing component and ideas.

To make the training script work effectively, we want to validate that the format of the reply is correct and that the answer itself is near the finished parts, so we write two reward functions. With a purpose to really see improvements within the latter reward, you would wish a quite maximalist setup, where you’ve gotten relatively larger models, plenty of generations, and a high-quality, diverse dataset.

import re
from math_verify import LatexExtractionConfig, parse, confirm

def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a selected format."""
    pattern = r"^.*?s*.*?$"
    matches = [re.match(pattern, content) for content in completions]
    rewards_list = [1.0 if match else 0.0 for match in matches]
    rewards = [1.0 if match else 0.0 for match in matches]
    print(completions)
    print(rewards)
    return rewards

def accuracy_reward(completions, **kwargs):
    """Reward function that checks if the completion is identical as the bottom truth."""
    solutions = kwargs['solution']
    completion_contents = [completion[0]["content"] for completion in completions]
    rewards = []
    for content, solution in zip(completion_contents, solutions):
        gold_parsed = parse(solution, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        answer_parsed = parse(content, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        if len(gold_parsed) != 0:
            try:
                rewards.append(float(confirm(answer_parsed, gold_parsed)))
            except Exception:
                rewards.append(0.0)
        else:
            rewards.append(1.0)
    return rewards

Then, you’ll be able to initialize GRPOConfig and GRPOTrainer, pass within the reward functions we defined above and call train() to start out training.

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    learning_rate=1e-5,
    max_prompt_length=None,
    ... 
)
trainer = GRPOTrainer(
    model=model_id,
    reward_funcs=[format_reward, accuracy_reward],
    args=training_args,
    train_dataset=train_dataset,
)
trainer.train()

Explore the total notebook example here.

Group Sequence Policy Optimization (GSPO)

Group Sequence Policy Optimization (GSPO) is a RL alignment algorithm recently released by Qwen that overcomes some limitations of GRPO. It achieves a more stable training computing importance sampling weights on the sequence level as an alternative of per-token. Its advantages are more relevant in MoE style models.

Latest TRL also introduces support for GSPO and because it’s a variant of GRPO’s loss, it comes with multimodal support. To create the trainer, the method is identical as with GRPO, but adding the next extra params (values are extracted from the paper).

from trl import GRPOConfig

training_args = GRPOConfig(
        ...
    importance_sampling_level="sequence",
    epsilon=3e-4,
    epsilon_high=4e-4,
    beta=0.0,
    loss_type="grpo",
    steps_per_generation=1,
    steps_per_generation=4
)

Comparison

Here’s a table summarizing model outputs for Qwen2.5VL-3B fine-tuned with the techniques discussed above. Note that we have done minimal runs on dataset subsets, and the models were fine-tuned on different datasets, so the comparison is made for vibe-check.

Take a look at the answers here

Image:

Query: Within the given figure, if the angle APO is 25.0 degrees, what’s the measure of angle AOB? Decisions: A: 140° B: 130° C: 120° D: 110°

Answer within the Dataset: To seek out the measure of angle AOB, we are able to use the property that the sum of the inside angles of a quadrilateral is the same as 360 degrees. Applying this property, we are able to calculate angle AOB as 360 – angle OAP – angle OBP – angle P. Since angle OAP and angle OBP are each 90 degrees, and angle P is given as 50 degrees, we are able to substitute these values into the equation to get angle AOB = 360 – 90 – 90 – 50 = 130 degrees. Due to this fact, the measure of angle AOB is 130 degrees. Due to this fact, the reply is option B. Answer:B

Base Model Output: To seek out the measure of angle AOB, we want to make use of the properties of angles subtended by the identical arc in a circle. Specifically, the angle subtended by an arc at the middle of the circle (angle AOB) is twice the angle subtended by the identical arc at any point on the circumference (angle APO).Provided that angle APO = 25°, we are able to calculate angle AOB as follows: Since angle AOB is twice angle APO, now we have:[ Angle AOB = 2 * Angle APO ][ Angle AOB = 2 * 25° ][ Angle AOB = 50° ]Nonetheless, this calculation seems incorrect based on the provided decisions. Let’s re-evaluate using the right property.The right property states that the angle subtended by an arc at the middle of the circle is twice the angle subtended by the identical arc at any point on the circumference. Due to this fact, the right calculation must be: Angle AOB = 2 * Angle APO = 2 * 25° = 50°But since this doesn’t match the given decisions, let’s consider one other approach. If angle APO is 25°, and considering the properties of cyclic quadrilaterals or the indisputable fact that the sum of opposite angles in a cyclic quadrilateral is 180°, we would need to take a look at the supplementary angles formed by the arc and the chord.Let’s assume the right interpretation involves finding the supplementary angle to the given angle APO which might give us the measure of angle AOB. Angle AOB = 180° – Angle APO = 180° – 25° = 155°This still doesn’t match the alternatives. Given the alternatives provided, it seems there is likely to be a misunderstanding or a special interpretation needed. Nonetheless, based on standard geometric principles, the right answer must be: Angle AOB = 2 * Angle APO = 2 * 25° = 50° If this is just not the case, please provide additional context or make clear the issue statement.

MPO Output: To seek out the measure of angle AOB, we want to make use of the properties of angles subtended by the identical arc in a circle. Specifically, the angle subtended by an arc at the middle of the circle (angle AOB) is twice the angle subtended by the identical arc at any point on the circumference (angle APO).Provided that angle APO = 25°, we are able to calculate angle AOB as follows: Since angle AOB is twice the angle APO, now we have:[ Angle AOB = 2 * Angle APO ][ Angle AOB = 2 * 25° ][ Angle AOB = 50° ]Nonetheless, this calculation seems incorrect based on the provided decisions. Let’s re-evaluate using the right property.The right property states that the angle subtended by an arc at the middle of the circle is twice the angle subtended by the identical arc at any point on the circumference. Due to this fact, the right calculation must be: Angle AOB = 2 * Angle APO = 2 * 25° = 50°But since this doesn’t match the given decisions, let’s consider one other approach. If we have a look at the triangle OAP, angle AOP is 90° since it is a right angle (because it is inscribed in a semicircle). Now, angle APO is given as 25°, so angle AOA can be 65° (because it is supplementary to angle APO). Due to this fact, angle AOB can be half of this, which is 32.5°. Nonetheless, this also doesn’t match the alternatives.Let’s consider the likelihood that the issue might involve a special interpretation or a special geometric configuration. Given the alternatives, let’s assume there is likely to be a mistake in the issue statement or the alternatives themselves. If we were to select from the given options, none of them appear to fit accurately based on standard geometric principles for one of these problem.

GRPO Output: To seek out the measure of angle AOB, we want to grasp that angle AOB is an inscribed angle that intercepts the identical arc because the central angle AOP. In keeping with the Inscribed Angle Theorem, the measure of an inscribed angle is half the measure of the central angle that subtends the identical arc.Given:- Angle APO = 25.0 degreesSince angle AOB is an inscribed angle intercepting the identical arc because the central angle AOP, we are able to use the Inscribed Angle Theorem:The measure of angle AOB is half the measure of angle AOP.Due to this fact, the measure of angle AOB is 130 degrees.

GSPO Output: To seek out the measure of angle AOB, we want to grasp that angle AOB is an inscribed angle that intercepts the identical arc because the central angle AOP. In keeping with the Inscribed Angle Theorem, the measure of an inscribed angle is half the measure of the central angle that subtends the identical arc.Given:- Angle APO = 25.0 degreesSince angle AOB is an inscribed angle that intercepts the identical arc because the central angle AOP, we are able to use the Inscribed Angle Theorem to seek out the measure of angle AOB:The measure of angle AOB is half the measure of angle AOP.Due to this fact, the reply is B: 130°.

Further Extensions for VLMs

Along with MPO, GRPO, and GSPO, TRL now supports Reinforce Leave One Out (RLOO) and Online Direct Preference Optimization (Online DPO) for Vision Language Models (VLMs), enabling alignment on multimodal datasets.

Reinforce Leave One Out (RLOO)

RLOO now supports common VLMs. Yow will discover a whole training example within the rloo_vlm.py script.

Here’s the best way to arrange a RLOOTrainer:

trainer = RLOOTrainer(
    model=model_name,
    args=training_args,
    reward_funcs=[think_format_reward, accuracy_reward],
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

And to launch training directly from the instance script:

CUDA_VISIBLE_DEVICES=1,2 python3 examples/scripts/rloo_vlm.py --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct

Online Direct Preference Optimization (Online DPO)

Online DPO also supports VLMs. See the online_dpo_vlm.py script for an easy example.

To run the instance script (vLLM integration will likely be discussed later):

CUDA_VISIBLE_DEVICES=1,2 python3 examples/scripts/online_dpo_vlm.py --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct --use_vllm --vllm_mode server

These scripts are ready-to-run for VLM training; full parameter tuning is documented in TRL: Online DPO trainer RLOO Trainer.

Native Supervised High-quality-tuning Support

Previously, SFTTrainer was partially supporting vision language models. This was primarily as a result of many differences across VLM implementations in transformers API. With the standardization of the transformers API, now we have shipped a full support for vision language models. You possibly can simply initialize SFTTrainer with a VLM.

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=SFTConfig(max_length=None), 
    train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()

To coach a VLM, you’ll want to provide a dataset with an extra images column containing the photographs to be processed. You possibly can take a have a look at Dataset Formats — Vision Datasets for more information on the way it should appear like. An excellent example is LLaVA Instruct Mix.

We even have a sft_vlm.py script that works out of the box for transformers vision language models.

vLLM Integration in TRL

vLLM is integrated in TRL to support online alignment methods where you’ll want to generate samples during training. Running the instance scripts like the next enables vLLM:

CUDA_VISIBLE_DEVICES=1,2 python3 examples/scripts/grpo_vlm.py --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct --use_vllm --vllm_mode colocate

There’s mainly two modes: colocate and server. colocate runs vLLM in the identical process because the training loop, sharing the identical GPU between training and generation, making a vLLM LLM instance contained in the GRPOTrainer. Meanwhile server requires you to serve vLLM individually in a special process where you’ll be able to hit the server. You possibly can start this server with the command:

trl vllm-serve --model Qwen/Qwen2.5-VL-3B-Instruct --tensor-parallel-size 1

You then can run the script as follows.

CUDA_VISIBLE_DEVICES=1,2 python3 examples/scripts/grpo_vlm.py --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct --use_vllm --vllm_mode server

Another tip: now we have added support for using vLLM with transformers backend in TRL. You possibly can enable it when running a script with colocate or when serving the model by passing the --vllm_model_impl transformers flag.

You possibly can read more about vLLM integration in TRL here.

Useful Resources

Below, you’ll find a compilation of resources to explore the alignment of VLMs intimately. Enjoy!

Source link

Vision Language Model Alignment in TRL ⚡️

Table of Contents

Alignment for Vision Language Models

Mixed Preference Optimization (MPO)

Multimodal Group Relative Policy Optimization (GRPO)

Group Sequence Policy Optimization (GSPO)

Comparison

Further Extensions for VLMs

Reinforce Leave One Out (RLOO)

Online Direct Preference Optimization (Online DPO)

Native Supervised High-quality-tuning Support

vLLM Integration in TRL

Useful Resources

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Bring Your Circuits to CUDA-Q Using QGEAR

Welcome GPT OSS, the brand new open-source model family from OpenAI!

Train an LLM on NVIDIA Blackwell with Unsloth—and Scale for Production

Reconstruct a Scene in NVIDIA Isaac Sim Using Only a Smartphone

A guide to Efficient Multi-GPU Training

Vision Language Model Alignment in TRL ⚡️

Table of Contents

Alignment for Vision Language Models

Mixed Preference Optimization (MPO)

Multimodal Group Relative Policy Optimization (GRPO)

Group Sequence Policy Optimization (GSPO)

Comparison

Further Extensions for VLMs

Reinforce Leave One Out (RLOO)

Online Direct Preference Optimization (Online DPO)

Native Supervised High-quality-tuning Support

vLLM Integration in TRL

Useful Resources

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.