small yet mighty Vision Language Model

-


This blog post introduces SmolVLM, a 2B VLM, SOTA for its memory footprint. SmolVLM is small, fast, memory-efficient, and fully open-source. All model checkpoints, VLM datasets, training recipes and tools are released under the Apache 2.0 license.

Image description



What’s SmolVLM?

This 12 months has seen a boom in multimodal AI with many large vision language models released. The trends were to initially scale up compute, later scale up the info diversity by generating synthetic data with large models, and, recently, scale all the way down to make these models more efficient. Small open models allow local deployment to browser or edge devices, cut inference costs, and enable user customization. Some notable examples of those models include PaliGemma 3B, moondream2, and Qwen2VL.

On this blog post, we introduce SmolVLM, a brand new family of 2B small vision language models that could be used commercially and deployed to smaller local setups, with completely open training pipelines.

We release three models: SmolVLM-Base, which could be used for downstream fine-tuning, SmolVLM-Synthetic, the fine-tuned variant on synthetic data, and SmolVLM Instruct, the fine-tuned instruction variant, which could be used out of the box for interactive end-user applications.

This release comes with open-source models integrated into transformers, a demo built on SmolVLM Instruct, and a supervised fine-tuning script. We’ve used the datasets previously used for Idefics3: the Cauldron and Docmatix, that are also fully open-source.



Table of Contents



Model capabilities

Input Image Input Prompt Answer
I’m planning a visit to this temple, give me travel suggestions. The Grand Palace is probably the most famous landmark in Bangkok and it’s value visiting for its stunning architecture and history. Listed below are a few of our favourite things in regards to the palace:

The Most important Constructing:
Situated at the middle of the complex, the important constructing features intricate tile work with traditional Thai motifs that cover every inch of the structure. This includes patterns similar to dragons, mythical creatures, flowers, and geometric shapes. The roofline has an elaborate design featuring multiple layers of tiles adorned with gold accents.

Towers & Spires:
There are several towers inside the Grand Palace including the Phra Nakhon Si Ayutthaya National Museum Tower which houses exhibits on Thai culture and heritage. One other notable tower is the Wat Yai Chai Mongkol Tower where you’ll find beautiful murals depicting scenes from Buddhist mythology.

Architecture Style:
The architectural style combines elements of Hindu and Khmer styles seen commonly throughout Southeast Asia. You will notice influences like stupas (dome-shaped structures), lotus flower carvings, and ornate details typical of those regions’ religious buildings.

Visitor Experience:
While exploring the Grand Palace, visitors should concentrate on certain rules attributable to security measures taken by authorities. These include no photography inside certain areas or using flashbulbs outside. Moreover, there could also be queues during peak tourist seasons so plan accordingly when you’re traveling here.



Travel Suggestions For Visiting The Grand Palace:

  • Best Time To Visit: The very best time to go to would likely be early morning before sunrise when temperatures drop significantly making walking around easier.
  • Transportation: Public transportation options exist but they won’t provide direct access to the Grand Palace itself; consider taking a taxi as an alternative.
  • Food Options: There aren’t many dining facilities available near the Grand Palace yet local street food vendors do offer delicious meals nearby.
  • Clothing Guidelines: While not mandatory, wearing traditional Thai clothing (sarongs) could add cultural immersion experience. Nevertheless, shorts/trousers are allowed provided they do not go below knee level.
  • Safety Measures: Visitors must follow signs indicating restricted zones marked “No Photography” amongst others. At all times carry identification documents just in case.

By following these guidelines, your trip will undoubtedly enhance each your understanding of Thailand’s wealthy historical tapestry and private enjoyment!

Where do the severe droughts occur in line with this diagram? The severe droughts are happening in eastern and southern Africa.
What’s the due date and the invoice date? The due date is 26/02/2019, and the invoice date is 11/02/2019.



Architecture

Image description

For SmolVLM, we closely followed the architecture from Idefics3, to the purpose that we use the identical implementation in transformers. There are, nonetheless a couple of key differences:

  • We replaced Llama 3.1 8B with SmolLM2 1.7B because the language backbone.
  • We more aggressively compress the patched visual information by reducing the knowledge 9x using the pixel shuffle strategy, in comparison with 4x with idefics3.
  • We use patches of 384*384, as an alternative of 364×364, because 384 is divisible by 3, which is vital for our pixel shuffle technique to work.
  • For this, we alter the vision backbone to make use of shape-optimized SigLIP with patches of 384×384 pixels and inner patches of 14×14.



Performance



Benchmarks

We present benchmarks for the tasks we mention in training details.

Model MMMU (val) MathVista (testmini) MMStar (val) DocVQA (test) TextVQA (val) Min GPU RAM required (GB)
SmolVLM 38.8 44.6 42.1 81.6 72.7 5.02
Qwen2-VL 2B 41.1 47.8 47.5 90.1 79.7 13.70
InternVL2 2B 34.3 46.3 49.8 86.9 73.4 10.52
PaliGemma 3B 448px 34.9 28.7 48.3 32.2 56.0 6.72
moondream2 32.4 24.3 40.3 70.5 65.2 3.87
MiniCPM-V-2 38.2 39.8 39.1 71.9 74.1 7.88
MM1.5 1B 35.8 37.2 0.0 81.0 72.5 NaN



Memory

Inference GPU memory use for SmolVLM and other models

SmolVLM provides one of the best memory usage amongst the prevailing suite of vision language models in transformers. This permits it to run efficiently on-device, similar to a laptop! You’ll be able to see above the GPU memory usage in GBs for every model, running inference with one or two input images, and using the identical images and text prompts in all tests. SmolVLM’s efficiency in image encoding is built into the model. SmolVLM encodes each 384×384 image patch to 81 tokens. This ends in SmolVLM encoding our test prompt and a single image in 1.2k tokens, whereas Qwen2-VL uses 16k tokens. This also explains why the memory consumption increases a lot for two images with Qwen and InternVL. In contrast, the rise is way more moderate for SmolVLM and PaliGemma, which use an analogous approach.



Throughput

Image description

SmolVLM’s tiny memory footprint also implies that it requires far fewer computations to prefill the model and generate. In comparison with Qwen2-VL, the prefill throughput is 3.3 to 4.5 times faster, and the generation throughput is 7.5 to 16 times faster.



Video

Given SmolVLM’s long context and the potential for tweaking the inner frame resizing of the model, we explored its suitability as an accessible option for basic video evaluation tasks, particularly when computational resources are limited.

In our evaluation of SmolVLM’s video understanding capabilities, we implemented a simple video processing pipeline code, extracting as much as 50 evenly sampled frames from each video while avoiding internal frame resizing.
This easy approach yielded surprisingly competitive results on the CinePile benchmark, with a rating of 27.14%, a performance that positions the model between InternVL2 (2B) and Video LlaVa (7B).

The quantitative results align with our qualitative testing, an example from the FineVideo dataset:

Query: what's shown within the close-up? 
Assistant: Within the close-up, there are bottles of Heinz tomato ketchup,
  honey, soy sauce, and garlic on a kitchen counter.

Query: describe what is occurring within the video. 
Assistant: The video shows a person cooking in a kitchen.
  He's wearing a black shirt and a black apron.
  He's standing in front of a counter and is holding a bottle of Heinz tomato ketchup,
  a bottle of honey, soy sauce, and a bowl of minced garlic

While within the second query, we see some temporal understanding limitations (the cook points to 1 ingredient after the opposite fairly than pointing/holding all of them at the identical time) SmolVLM demonstrated great scene understanding and object recognition capabilities.



VLMEvalKit integration

We integrated SmolVLM with VLMEvalKit to facilitate easy evaluation across additional benchmarks.

By running the next command, you’ll be able to evaluate SmolVLM or your fine-tuned SmolVLM model.

python run.py --data  --model SmolVLM --work-dir 

For instance, to guage on MMMU dev validation set and MathVista mini and store the ends in a folder called smol.

python run.py --data MMMU_DEV_VAL MathVista_MINI --model SmolVLM --work-dir smol



Use SmolVLM with Transformers

You’ll be able to easily load SmolVLM using the Auto classes in transformers. Under the hood, the model and processor are mapped to the identical implementations used for Idefics3.

from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-Instruct",
                                                torch_dtype=torch.bfloat16,
                                                _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager").to(DEVICE)

Image and text could be interleaved arbitrarily, and you’ll be able to pass in multiple images. Here’s how you should use the chat template and pass within the formatted input to the processor.

from PIL import Image
from transformers.image_utils import load_image



image1 = load_image("https://huggingface.co/spaces/HuggingFaceTB/SmolVLM/resolve/important/example_images/rococo.jpg")
image2 = load_image("https://huggingface.co/spaces/HuggingFaceTB/SmolVLM/resolve/important/example_images/rococo_1.jpg")


messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Can you describe the two images?"}
        ]
    },
]


prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = inputs.to(DEVICE)

Start generating with preprocessed input and decode the generated output.


generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])



Training Details



Dataset

First, we had to coach SmolLM2 to increase it context, but we are going to discuss that in the following subsection. Once we had a protracted context SmolLM2, we trained SmolVLM using the identical data that we used for Idefics3. Mainly, we used The Cauldron and Docmatix. The complete list of datasets we used could be consulted here.

Image description



Context extension

Image description

SmolLM2’s pre-training context window is insufficient for VLMs. Images are encoded into many tokens, and we desired to support multiple images. To deal with this, we prolonged it to 16k tokens by increasing the RoPE base value from 10k to 273k, following the rules in “Scaling Laws of RoPE-based Extrapolation”. We fine-tuned the model on a combination of long- and short-context datasets.
For long-context datasets, we used the “books” subset of Dolma (primarily Project Gutenberg) and code documents with 8k+ tokens from The Stack, each contributing 20% to the ultimate mixture. For brief-context datasets, we streamlined the unique SmolLM2 pre-training mix to incorporate 20% FineWeb-Edu, 20% DCLM, and 20% from our math dataset (to be released soon). The mathematics dataset was upsampled to mitigate a performance drop observed on GSM8k in the course of the context extension process.
All experiments were implemented using the EasyContext repository.



Checkpoint Selection

For our training run, we saved checkpoints every 25 optimization steps, allowing us to guage and potentially recuperate the model’s state at different points in training. This practice is crucial for identifying the optimal model version, as training longer doesn’t all the time guarantee higher performance.
We evaluated the performance across multiple vision-language benchmarks, each weighted in line with their importance. The core benchmarks included the next:

  • General multimodal understanding (MMMU and MMStar) that are probably the most comprehensive benchmark.
  • Document and text-based visual query answering (DocVQA and TextVQA)
  • Mathematical Reasoning (MathVista)
  • Diagram understanding (AI2D)

To pick the optimal checkpoint, we created a single metric by combining these benchmarks with different manually assigned weights to reflect their relative importance in assessing the model’s capabilities. We used this single metric to pick one of the best checkpoint. Generally, the models tended to do great on most benchmarks with more training, but their relative performance on DocVQA would decrease considerably.



Tremendous-tuning

You’ll be able to fine-tune SmolVLM using transformers and apply alignment techniques using TRL 🚀

We offer a notebook to fine-tune it on the VQAv2 dataset, optionally using LoRA, QLoRA or full fine-tuning. Within the notebook, you’ll find some tricks to save lots of up much more memory and have a bigger batch size to suit SmolVLM inside consumer GPUs, like L4, for training. With batch sizes of 4, 8-bit loading with QLoRA and gradient checkpointing we will fine-tune in L4, and it consumes around ~16 GBs of VRAM. This makes it possible to fine-tune your SmolVLM using Colab! You’ll be able to mess around with the parameters to get a pleasant point in training duration-memory trade-off.

SmolVLM also comes with TRL integration so you’ll be able to apply Direct Preference Optimization (DPO) easily through the CLI. Start by running pip install trl speed up peft after which run the next command to fine-tune on RLAIF-V dataset:

speed up launch 
  --config_file examples/accelerate_configs/multi_gpu.yaml examples/scripts/dpo_vlm.py  
  --dataset_name HuggingFaceH4/rlaif-v_formatted 
  --model_name_or_path HuggingFaceTB/SmolVLM-Instruct 
  --per_device_train_batch_size 8 
  --gradient_accumulation_steps 32 
  --dataset_num_proc 32 
  --output_dir dpo_smolvlm_rlaif-v 
  --bf16 --torch_dtype bfloat16 
  --use_peft --lora_target_modules=all-linear 

The resulting LoRA adapter weights are SmolVLM-Instruct-DPO. An in depth tutorial on preference tuning vision-based LLM could be found here: dpo_vlm.



Citation information

You’ll be able to cite us in the next way:

@article{marafioti2025smolvlm,
  title={SmolVLM: Redefining small and efficient multimodal models}, 
  creator={Andrés Marafioti and Orr Zohar and Miquel Farré and Merve Noyan and Elie Bakouch and Pedro Cuenca and Cyril Zakka and Loubna Ben Allal and Anton Lozhkov and Nouamane Tazi and Vaibhav Srivastav and Joshua Lochner and Hugo Larcher and Mathieu Morlon and Lewis Tunstall and Leandro von Werra and Thomas Wolf},
  journal={arXiv preprint arXiv:2504.05299},
  12 months={2025}
}



Wrapping Up

We introduced SmolVLM, a completely open, small, and mighty VLM for the community! We also provide tools for the community to make use of and customize it. We’re looking forward to seeing what you’ll create with SmolVLM.

Below are some resources when you would really like to read more about all things related to SmolVLM.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x