Vision Language Models (Higher, faster, stronger)

-


Vision Language Models (VLMs) are the talk of the town. In a previous blog post (from April 2024), we talked so much about VLMs. A serious chunk was about LLaVA, the primary successful and easily reproducible open-source vision language model, together with tips about easy methods to discover, evaluate, and fine-tune open models.

Since then, a lot has modified. Models have turn into smaller yet more powerful. We have seen the rise of latest architectures and capabilities (reasoning, agency, long video understanding, etc.). In parallel, entirely latest paradigms, reminiscent of multimodal Retrieval Augmented Generation (RAG) and multimodal agents have taken shape.

On this blog post, we’ll have a look back and unpack all the things that happened with vision language models the past 12 months. You’ll discover key changes, emerging trends, and notable developments.

We highly recommend reading the primary blog post in case you want a great primer on how vision language models work.



Table of Contents



Recent model trends

On this section, we’ll have a look at the brand new kinds of VLMs. While some are absolutely latest, others are improved versions of previous research.



Any-to-any models

Any-to-any models, because the name suggests, are models that may absorb any modality and output any modality (image, text, audio). They do it by aligning the modalities, where an input from one modality could be translated to a different (e.g. the word “dog” can be related to a picture of a dog, or with the utterance of the word).

These models have multiple encoders (one for every modality) after which fuse the embeddings together to create a shared representation space. The decoders (multiple or single) use the shared latent space as input and decode into the modality of alternative. Earliest try and construct any-to-any models is Chameleon by Meta, which may absorb image and text and output image and text. Meta didn’t release the image generation capability on this model, so Alpha-VLLM released Lumina-mGPT, which has built image generation on top of Chameleon.

The newest and most capable any-to-any model, Qwen 2.5 Omni (figure below) is a great example to grasp the architecture of an any-to-any model.

Qwen-Omni

Qwen2.5-Omni employs a novel “Thinker-Talker” architecture, where the “Thinker” handles text generation, and the “Talker” produces natural speech responses in a streaming manner. MiniCPM-o 2.6, an 8B parameter multimodal model is able to understanding and generating content across vision, speech, and language modalities. Janus-Pro-7B, introduced by DeepSeek AI, is a unified multimodal model that excels in each understanding and generating content across modalities. It contains a decoupled visual encoding architecture, separating the processes for understanding and generation.

We suspect an uptick within the variety of such models in the approaching years. It’s a well known intuition that multimodal learning is the one way we are able to learn deep representations higher. Now we have curated some any-to-any models and demos in this collection.



Reasoning Models

Reasoning models are models that may solve complex problems. We saw them first with large language models, and now vision language models. Until 2025, there was just one open-source multimodal reasoning model, QVQ-72B-preview by Qwen. It was an experimental model that was developed by the Alibaba Qwen team and got here with many disclaimers.

This 12 months there’s one other player, Kimi-VL-A3B-Pondering by the Moonshot AI team. It consists of MoonViT (SigLIP-so-400M) because the image encoder and a Mixture-of-Experts (MoE) decoder with 16B total parameters and only 2.8B energetic parameters. The model is a protracted chain-of-thought fine-tuned and further aligned (reinforcement learning) version of the Kimi-VL base vision language model. You’ll be able to try the model here.

The authors also released an instruction fine-tuned version called Kimi-VL-A3B-Instruct.

kimi-vl
The model can absorb long videos, PDFs, screenshots and more. It has agentic capabilities as well.



Smol yet Capable Models

The community used to scale intelligence through the variety of parameters, after which high-quality synthetic data. After a certain point, the benchmarks saturated and scaling models had diminishing returns. The community went to shrink larger models through various methods, like distillation. This is sensible since it reduces compute costs, simplifies deployment, and unlocks use cases like local execution, enhancing data privacy.

Once we say small vision language models we frequently consult with models with lower than 2B parameters that could be run on consumer GPUs. SmolVLM is a great example model family for smaller vision language models. As an alternative of shrinking larger models, the authors went all the way in which and tried to suit models into tiny variety of parameters like 256M, 500M and a pair of.2B. SmolVLM2, for example, attempted to resolve video understanding in these sizes and located 500M to be a great trade-off. At Hugging Face, now we have built an iPhone application, HuggingSnap, to exhibit that these model sizes can achieve video understanding on consumer devices.

One other striking model is gemma3-4b-it by Google DeepMind. It’s particularly exciting because it’s one among the smallest multimodal models to have 128k token context window, and supports 140+ languages. The model comes with the Gemma 3 family of models, with its largest model rating first on Chatbot Arena on the time. The most important model was then distilled to a 1B variant.

Lastly, although not the smallest, Qwen2.5-VL-3B-Instruct is value noting. The model can do various tasks starting from localization (object detection and pointing), to document understanding, to agentic tasks; with context length as much as 32k tokens.

You should utilize small models through MLX and Llama.cpp integrations. For MLX, assuming you have got it installed, you may start with SmolVLM-500M-Instruct with this one liner:

python3 -m mlx_vlm.generate --model HuggingfaceTB/SmolVLM-500M-Instruct --max-tokens 400 --temp 0.0 --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/primary/vlm_example.jpg --prompt "What's on this image?" 

You’ll be able to start with using gemma-3-4b-it model in GGUF format with llama.cpp through CLI with this one-liner:

llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF  

It’s also possible to serve the identical model as follows.

llama-server -hf ggml-org/gemma-3-4b-it-GGUF  

We would really like to offer a shoutout to moondream2 and Florence-2 as they’re the earliest attempts for smallest vision language models. On this blog, we’re covering primarily newer models (mostly models that got here out after April 2024).



Mixture-of-Experts as Decoders

Mixture of Expert (MoEs) models offer an alternative to dense architectures by dynamically choosing and activating only probably the most relevant sub-models, termed “experts”, to process a given input data segment. This selective activation (done by a router) mechanism has demonstrated the potential to substantially enhance model performance and operational efficiency while utilizing fewer computational resources.

MoEs are faster at inference than their similar parameter-dense counterparts due to selective activation of a smaller slice of the network. In addition they converge quickly during training. Every good thing comes with a price, as MoEs need more memory cost resulting from the complete model being on the GPU, even when a smaller chunk is used.

Within the widely adopted Transformer architecture, MoE layers are mostly integrated by replacing the usual Feed-Forward Network (FFN) layers inside each Transformer block. Dense networks use the complete model to run an inference, while similarly sized MoE networks selectively activate some experts. This helps in higher compute utilization and faster inference.

Vision language models which have mixture-of-experts decoders appear to have enhanced performance. For example, Kimi-VL as of now’s probably the most advanced open reasoning model that has a mixture-of-experts decoder. Mixture-of-Experts show promising results with MoE-LLaVA‘s deal with efficiency and hallucination reduction and DeepSeek-VL2‘s broad multimodal capabilities too. The newest version of Llama (Llama 4) is an MoE with vision capabilities. MoE as a decoder is a promising research area, and we suspect a rise in models like these.

To get a pleasant understanding of MoEs we recommend reading this unbelievable article.



Vision-Language-Motion Models

VLMs are even making their mark in the sector of robotics! There, they’re generally known as Vision-language-action models (VLA). But do not be fooled, those are mainly VLMs with just a little moustache and hat. VLAs take images and text instructions, and return text indicating actions for the robot to take directly. VLAs extend vision language models by adding motion and state tokens to interact with and control physical environments. These extra tokens represent the system’s internal state (the way it perceives the environment), actions (what it does based on commands), and time-related information (just like the order of steps in a task). These tokens are appended to the vision language input to generate actions or policy.

VLAs are frequently fine-tuned on top of a base VLM. Some people extend this definition further and define VLAs as any model interacting visually with an actual or digital world. On this definition, VLAs can do UI navigation or be utilized in agentic workflows. But many individuals imagine those applications fall within the VLM domain.

Great examples of VLAs are π0 and π0-FAST, the primary robotics foundation models by Physical Intelligence, ported to Hugging Face’s LeRobot library. These models are trained across 7 robotics platforms and 68 unique tasks. They show strong zero-shot and fine-tuned performance on complex, real-world activities reminiscent of laundry folding, table bussing, grocery bagging, box assembly, and object retrieval.

GR00T N1 is NVIDIA’s open VLA foundation model for generalist humanoid robots. It understands images and language, and turns them into actions, like moving its arms or following instructions, because of a system that mixes smart reasoning with real-time movement control. GR00T N1 also builds on the LeRobot dataset format, the open standard created to simplify sharing and training on robot demonstrations.

pi0

Taken from the paper

Now that we’ve checked out the newest VLM model innovations, let’s explore how more established capabilities have evolved.



Specialized Capabilities



Object Detection, Segmentation, Counting with Vision Language Models

As we’ve seen in earlier sections, VLMs enable generalization over traditional computer vision tasks. Models can now absorb images and quite a lot of prompts, reminiscent of open-ended text, and output structured text with localization tokens (for detection, segmentation and more).

Last 12 months, PaliGemma was the primary model to try solving these tasks. The model takes in a picture and text, where text is an outline of an object of interest, together with a task prefix. The text prompt looks like “segment striped cat” or “detect bird on the roof”.

For detection, the model outputs the bounding box coordinates as tokens. For segmentation, alternatively, the model outputs detection tokens and segmentation tokens. These segmentation tokens aren’t all of the segmented pixel coordinates, but codebook indices which are decoded by a variational autoencoder trained to decode these tokens into valid segmentation masks (as shown within the figure below).

PaliGemma3

Many models have been introduced to do localization tasks after PaliGemma. Late last 12 months, an upgraded version of PaliGemma, PaliGemma 2, appeared with the identical capabilities and higher performance. One other model that got here later was Molmo by Allen AI, which may point to instances with dots and count object instances.

molmo

Qwen2.5-VL may also detect, point to, and count objects, and this includes UI elements as objects too!

Qwen2.5VL



Multimodal Safety Models

Vision language models in production require filtering inputs and outputs to stop jailbreaks and harmful outputs for compliance. Harmful content varies from inputs with violence to sexually explicit content. That’s where multimodal safety models are available: they’re used before and after vision language models to filter their inputs and outputs. They’re identical to LLM safety models but with additional image input.

In early 2025, Google introduced the primary open multimodal safety model, ShieldGemma 2. It’s built on ShieldGemma, the text-only safety model. This model takes in images and content policies and returns whether a picture is protected for a given policy. Policy refers to a criterion wherein the image is inappropriate. ShieldGemma 2 can be used to filter outputs of image generation models.

Llama Guard 4 by Meta, is a dense multimodal and multilingual safety model. It’s densely pruned from Llama 4 Scout (a multimodal mixture-of-experts) with safety superb tuning.

Llama Guard 4

The model could be used for text-only and multimodal inference. The model may also absorb vision language model outputs, complete conversations, and filter them before sending them to the user.



Multimodal RAG: retrievers, rerankers

Now let’s have a look at how Retrieval Augmented Generation has evolved within the multimodal space. RAG for complex documents, normally formatted in PDF, is processed in three steps:

  1. parsing the document completely into text
  2. passing the plain text and the query to a retriever and a reranker to get probably the most relevant document
  3. passing the relevant context and query to an LLM

A standard PDF parser consists of multiple elements to preserve the structure and visual elements within the document, reminiscent of layout, tables, images, charts, all rendered right into a markdown. But this setup could be hard to take care of.
Traditional Parsing

With the rise of vision language models, this issue was addressed: there are actually multimodal retrievers and rerankers.

Multimodal RAG

Multimodal retrievers take a stack of PDFs and a question as input and return probably the most relevant page numbers together with their confidence scores. The scores represent how likely the page accommodates the reply to the query, or how relevant the query is to the page. This bypasses the brittle parsing step.

Probably the most relevant pages are then fed to the vision language model together with the query, and the VLM generates the reply.

There are two primary multimodal retriever architectures:

  1. Document Screenshot Embedding (DSE, MCDSE)
  2. ColBERT-like models (ColPali, ColQwen2, ColSmolVLM)

DSE models consist of a text encoder and a picture encoder, returning a single vector per query. The returned scores are softmax over the dot products of embeddings. They return a single vector per passage.

DSE

Taken from the paper

ColBERT-like models, like ColPali, are also dual encoder models, with a twist: ColPali has a vision language model as a picture encoder, and a big language model as a text encoder. These models are inherently not encoders, however the models output embeddings, that are then passed to a “MaxSim”. The outputs are multiple vectors, one for every token, unlike DSE. In MaxSim, the similarity between each text token embedding and every image patch embedding is calculated, and this approach captures nuances higher. Resulting from this reason, ColBERT-like models are less cost-efficient, have higher performance.

Below you may see the indexing latency for ColPali. Because it’s only a single model, it’s also easier to take care of.

ColPali
Taken from the paper

On Hugging Face Hub, you will discover these models under the duty “Visual Document Retrieval”.

The preferred benchmark for this task is ViDoRe, which consists of documents in English and French, with documents various from financial reports, scientific figures to administrative documents. Each example of ViDoRe has the document as image, a question and potential answers. The documents matched with the queries help with contrastive pre-training, so the ViDoRe train set is used to coach latest models.



Multimodal Agents

Vision language models unlock many agentic workflows from chatting with documents to computer use. Here we’ll cover the latter because it requires more advanced agentic capabilities. Recently, there have been many vision language models releases that understand and operate over UIs. The newest one is UI-TARS-1.5 by ByteDance, which showed great ends in operating over browser, computer and phone use. It could actually also do gameplay with reasoning, and operate in open world games. One other impactful release of this 12 months is MAGMA-8B, it’s a foundation model for each UI navigation and physical interaction with the actual world. Furthermore, Qwen2.5-VL (especially its 32B variant because it is further trained on agentic tasks) and Kimi-VL reasoning model are good in GUI agentic tasks.

At the start of 2025, we introduced smolagents, a brand new lightweight agentic library that implements the ReAct framework. Shortly after, we implemented vision language support for the library. This integration took place on two use cases:

  • At the start of the run, provide images for once. This is beneficial for document AI with tool use.
  • Dynamically retrieve images. This is beneficial for cases reminiscent of GUI control with VLM agents, where the agent repeatedly takes screenshots.

The library provides constructing blocks for the users to construct their very own agentic workflows with image understanding. We offer different scripts and single-line CLI commands to get the users began easily.

For the primary case, assume we wish an agent to explain documents (which isn’t very agentic, but good for minimal use cases). You’ll be able to initialize the CodeAgent (an agent that writes its own code!) like the next:

agent = CodeAgent(tools=[], model=model) 
agent.run("Describe these documents:", images=[document_1, document_2, document_3])

For the latter use case where we’d like an agent to get screenshots, we are able to define a callback to be executed at the tip of every ActionStep. For your individual use case where you should get images dynamically, modify the callback nonetheless you’d like. We won’t define it here intimately for simplicity. Optionally, you may read the blog post and the script itself at the tip of this blog post. For now, let’s see how we initialize the agent with callbacks and browser control steps.

def save_screenshot(memory_step: ActionStep, agent: CodeAgent) -> None:
    """ 
    Takes screenshots and writes to observations.
"""
  png_bytes = driver.get_screenshot_as_png()
        memory_step.observations_images = [image.copy()]  
    url_info = f"Current url: {driver.current_url}"
    memory_step.observations = (
        url_info if memory_step.observations is None else memory_step.observations + "n" + url_info
    )
    return

agent = CodeAgent(
    tools=[go_back, close_popups, search_item_ctrl_f], 
    model=model,
    additional_authorized_imports=["helium"],
    step_callbacks=[save_screenshot], 
)

You’ll be able to simply try the entire example by running the next CLI command. It starts an agent with access to regulate over the net browser, powered by a vision language model to perform web automation tasks (please replace with the web site you’d prefer to navigate to).

webagent "go to xyz.com/men, get to sale section, click the primary clothing item you see. Get the product details, and the worth, return them. note that I'm shopping from France"   

smolagents provides different model types, reminiscent of local transformers models, open-source models served using Inference Providers, or endpoints closed-source model providers. We encourage the usage of open-source models as many agentic workflows currently require reasoning, which advantages from models with a lot of parameters. Qwen 2.5 VL as of April 2025 is a great candidate for agentic workflows, because the model is further trained on agentic tasks.



Video Language Models

Most vision language models as of late can handle videos, because videos could be represented as a sequence of frames. Nonetheless, video understanding is difficult due to temporal relationship between frames and the massive amount of frames, so different techniques are used to pick out a representative set of video frames.
Video LMs

Since last 12 months, the community has weighed on different approaches and tricks to resolve this problem.

example is the LongVU model by Meta. It downsamples video frames by passing them to DINOv2 to select probably the most similar ones to remove them, after which the model further refines frames by picking probably the most relevant frames based on the text query, where each the text and the frames are projected to the identical space and similarity is calculated. Qwen2.5VL can handle long context and is customized to dynamic FPS rates, because the model is trained with videos with different frame rates. Through prolonged multimodal RoPE, it understands absolutely the time positions of frames, and might handle different rates and still understand the speed of the events happening in real life. One other model is Gemma 3, which may accept video frames interleaved with timestamps in text prompt, e.g. “Frame 00.00: ..”, and could be very performant for video understanding tasks.

MRoPE
Taken from the paper



Recent Alignment Techniques for Vision Language Models

Preference optimization is another fine-tuning approach for language models that can be prolonged to vision language models. As an alternative of counting on fixed labels, this method focuses on comparing and rating candidate responses based on preferences. The trl library offers support for direct preference optimization (DPO), including for VLMs.

Below is an example of how a preference dataset for DPO of a VLM fine-tuning is structured. Each entry consists of a picture + query pair and two corresponding answers: one chosen and one rejected. The VLM is fine-tuned to generate responses aligned with the popular (chosen) answer.

DPO

An example dataset for this procedure is RLAIF-V, which accommodates over 83000 annotated samples formatted based on the structure described above. Each entry includes an inventory of images (normally one), a prompt, a selected answer, and a rejected answer, just as expected by the DPOTrainer.
There may be a RLAIF-V formatted dataset, which is already formatted accordingly. Below is an example of a single sample:

{'images': [980x812 at 0x154505570>],
 'prompt': [ { "content": [ { "text": null, "type": "image" }, { "text": "What should this catcher be using?", "type": "text" } ], "role": "user" } ],
 'rejected': [ { "content": [ { "text": "The catcher, identified by the number...", "type": "text" } ], "role": "assistant" } ],
 'chosen': [ { "content": [ { "text": "The catcher in the image should be using a baseball glove...", "type": "text" } ], "role": "assistant" } ]}

Once the dataset is ready, you should use the DPOConfig and DPOTrainer classes from the trl library to configure and launch the fine-tuning process.
Below is an example configuration using DPOConfig:

from trl import DPOConfig

training_args = DPOConfig(
    output_dir="smolvlm-instruct-trl-dpo-rlaif-v",
    bf16=True,
    gradient_checkpointing=True,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=32,
    num_train_epochs=5,
    dataset_num_proc=8,  
    dataloader_num_workers=8,  
    logging_steps=10,
    report_to="tensorboard",
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,
    save_total_limit=1,
    eval_steps=10,  
    eval_strategy="steps",
)

To coach your model using DPOTrainer, you may optionally provide a reference model to compute the reward difference. If you happen to’re using Parameter-Efficient Effective-Tuning (PEFT), you might omit the reference model by setting ref_model=None.

from trl import DPOTrainer

trainer = DPOTrainer(
    model=model,
    ref_model=None,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    peft_config=peft_config,
    tokenizer=processor
)

trainer.train()



Recent benchmarks

Benchmarks have also evolved significantly over the past 12 months. In our previous blog, we described MMMU and MMBench as two emerging benchmarks for evaluating Vision Language Models. With the rapid progress in the sector, the models have saturated on these benchmarks, and we’d like higher evaluation tools. To realize this, we’d like tools that assess specific capabilities, on top of general purpose benchmarks.

MMT-Bench

Now, we highlight two general-purpose benchmarks that stand out: MMT-Bench and MMMU-Pro.



MMT-Bench

MMT-Bench is designed to evaluate VLMs across a wide selection of multimodal tasks that require expert knowledge, precise visual recognition, localization, reasoning, and planning. The benchmark includes 31325 multi-choice visual questions from various multimodal scenarios, with image, text, video, and point cloud modalities. With 32 different meta-tasks with 162 subtasks, it covers quite a lot of tasks, including OCR, Visual Recognition, or Visual-Language Retrieval.



MMMU-Pro

MMMU-Pro is a greater version of the unique MMMU benchmark. It also evaluates advanced AI models’ true understanding capabilities across multiple modalities.
It’s more complex than MMMU, e.g. it has a vision-only input setting and a rise within the variety of candidate options from 4 to 10. The benchmark also incorporates real-world simulation, with vision-only questions derived from screenshots or photos captured inside a simulated display, featuring various backgrounds, font styles, and sizes to mimic real-world conditions.


Extra: Our model picks

Listed below are our picks for some highlighted models. There are a lot of models that we like, those below are the newest.

Model Name Sizes Why we find it irresistible
Qwen2.5-VL from 3B to 72B Great versatile model with agentic capabilities, math and more
RolmOCR 7B Very performant OCR model
Kimi-VL-Pondering 16B MoE with 3B energetic parameters Best reasoning model
SmolVLM2 256M, 500M (our favourite!), 2.2B Smallest video language model
Llama 4 Scout & Maverick 109B/400B MoE with 17B energetic parameters Loooooong context
Molmo 1B, 7B, 72B and MoE with 1B energetic parameters Fully open model with localization capabilities on top

Aaaaand that’s it! We hope you found this blog post useful to meet up with all of the things that happened this past 12 months. We’re looking forward to seeing all of the stuff you’ll construct with the models on this blog. 🤗 Below we offer some links on more in-depth explanations to every topic on this blog post.

We thank Vaibhav Srivastav and Pablo Montalvo Leroux for his or her review on this blog.



Useful Resources

Here’s a compilation of blogs where we went through the items within the blog post in-depth.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x