This blog post introduces SmolVLM, a 2B VLM, SOTA for its memory footprint. SmolVLM is small, fast, memory-efficient, and fully open-source. All model checkpoints, VLM datasets, training recipes and tools are released under the Apache 2.0 license.

What’s SmolVLM?
This 12 months has seen a boom in multimodal AI with many large vision language models released. The trends were to initially scale up compute, later scale up the info diversity by generating synthetic data with large models, and, recently, scale all the way down to make these models more efficient. Small open models allow local deployment to browser or edge devices, cut inference costs, and enable user customization. Some notable examples of those models include PaliGemma 3B, moondream2, and Qwen2VL.
On this blog post, we introduce SmolVLM, a brand new family of 2B small vision language models that could be used commercially and deployed to smaller local setups, with completely open training pipelines.
We release three models: SmolVLM-Base, which could be used for downstream fine-tuning, SmolVLM-Synthetic, the fine-tuned variant on synthetic data, and SmolVLM Instruct, the fine-tuned instruction variant, which could be used out of the box for interactive end-user applications.
This release comes with open-source models integrated into transformers, a demo built on SmolVLM Instruct, and a supervised fine-tuning script. We’ve used the datasets previously used for Idefics3: the Cauldron and Docmatix, that are also fully open-source.
Table of Contents
Model capabilities
Architecture

For SmolVLM, we closely followed the architecture from Idefics3, to the purpose that we use the identical implementation in transformers. There are, nonetheless a couple of key differences:
- We replaced Llama 3.1 8B with SmolLM2 1.7B because the language backbone.
- We more aggressively compress the patched visual information by reducing the knowledge 9x using the pixel shuffle strategy, in comparison with 4x with idefics3.
- We use patches of 384*384, as an alternative of 364×364, because 384 is divisible by 3, which is vital for our pixel shuffle technique to work.
- For this, we alter the vision backbone to make use of shape-optimized SigLIP with patches of 384×384 pixels and inner patches of 14×14.
Performance
Benchmarks
We present benchmarks for the tasks we mention in training details.
| Model | MMMU (val) | MathVista (testmini) | MMStar (val) | DocVQA (test) | TextVQA (val) | Min GPU RAM required (GB) |
|---|---|---|---|---|---|---|
| SmolVLM | 38.8 | 44.6 | 42.1 | 81.6 | 72.7 | 5.02 |
| Qwen2-VL 2B | 41.1 | 47.8 | 47.5 | 90.1 | 79.7 | 13.70 |
| InternVL2 2B | 34.3 | 46.3 | 49.8 | 86.9 | 73.4 | 10.52 |
| PaliGemma 3B 448px | 34.9 | 28.7 | 48.3 | 32.2 | 56.0 | 6.72 |
| moondream2 | 32.4 | 24.3 | 40.3 | 70.5 | 65.2 | 3.87 |
| MiniCPM-V-2 | 38.2 | 39.8 | 39.1 | 71.9 | 74.1 | 7.88 |
| MM1.5 1B | 35.8 | 37.2 | 0.0 | 81.0 | 72.5 | NaN |
Memory

SmolVLM provides one of the best memory usage amongst the prevailing suite of vision language models in transformers. This permits it to run efficiently on-device, similar to a laptop! You’ll be able to see above the GPU memory usage in GBs for every model, running inference with one or two input images, and using the identical images and text prompts in all tests. SmolVLM’s efficiency in image encoding is built into the model. SmolVLM encodes each 384×384 image patch to 81 tokens. This ends in SmolVLM encoding our test prompt and a single image in 1.2k tokens, whereas Qwen2-VL uses 16k tokens. This also explains why the memory consumption increases a lot for two images with Qwen and InternVL. In contrast, the rise is way more moderate for SmolVLM and PaliGemma, which use an analogous approach.
Throughput

SmolVLM’s tiny memory footprint also implies that it requires far fewer computations to prefill the model and generate. In comparison with Qwen2-VL, the prefill throughput is 3.3 to 4.5 times faster, and the generation throughput is 7.5 to 16 times faster.
Video
Given SmolVLM’s long context and the potential for tweaking the inner frame resizing of the model, we explored its suitability as an accessible option for basic video evaluation tasks, particularly when computational resources are limited.
In our evaluation of SmolVLM’s video understanding capabilities, we implemented a simple video processing pipeline code, extracting as much as 50 evenly sampled frames from each video while avoiding internal frame resizing.
This easy approach yielded surprisingly competitive results on the CinePile benchmark, with a rating of 27.14%, a performance that positions the model between InternVL2 (2B) and Video LlaVa (7B).
The quantitative results align with our qualitative testing, an example from the FineVideo dataset:
Query: what's shown within the close-up?
Assistant: Within the close-up, there are bottles of Heinz tomato ketchup,
honey, soy sauce, and garlic on a kitchen counter.
Query: describe what is occurring within the video.
Assistant: The video shows a person cooking in a kitchen.
He's wearing a black shirt and a black apron.
He's standing in front of a counter and is holding a bottle of Heinz tomato ketchup,
a bottle of honey, soy sauce, and a bowl of minced garlic
While within the second query, we see some temporal understanding limitations (the cook points to 1 ingredient after the opposite fairly than pointing/holding all of them at the identical time) SmolVLM demonstrated great scene understanding and object recognition capabilities.
VLMEvalKit integration
We integrated SmolVLM with VLMEvalKit to facilitate easy evaluation across additional benchmarks.
By running the next command, you’ll be able to evaluate SmolVLM or your fine-tuned SmolVLM model.
python run.py --data --model SmolVLM --work-dir
For instance, to guage on MMMU dev validation set and MathVista mini and store the ends in a folder called smol.
python run.py --data MMMU_DEV_VAL MathVista_MINI --model SmolVLM --work-dir smol
Use SmolVLM with Transformers
You’ll be able to easily load SmolVLM using the Auto classes in transformers. Under the hood, the model and processor are mapped to the identical implementations used for Idefics3.
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-Instruct",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager").to(DEVICE)
Image and text could be interleaved arbitrarily, and you’ll be able to pass in multiple images. Here’s how you should use the chat template and pass within the formatted input to the processor.
from PIL import Image
from transformers.image_utils import load_image
image1 = load_image("https://huggingface.co/spaces/HuggingFaceTB/SmolVLM/resolve/important/example_images/rococo.jpg")
image2 = load_image("https://huggingface.co/spaces/HuggingFaceTB/SmolVLM/resolve/important/example_images/rococo_1.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "Can you describe the two images?"}
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = inputs.to(DEVICE)
Start generating with preprocessed input and decode the generated output.
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(generated_texts[0])
Training Details
Dataset
First, we had to coach SmolLM2 to increase it context, but we are going to discuss that in the following subsection. Once we had a protracted context SmolLM2, we trained SmolVLM using the identical data that we used for Idefics3. Mainly, we used The Cauldron and Docmatix. The complete list of datasets we used could be consulted here.

Context extension

SmolLM2’s pre-training context window is insufficient for VLMs. Images are encoded into many tokens, and we desired to support multiple images. To deal with this, we prolonged it to 16k tokens by increasing the RoPE base value from 10k to 273k, following the rules in “Scaling Laws of RoPE-based Extrapolation”. We fine-tuned the model on a combination of long- and short-context datasets.
For long-context datasets, we used the “books” subset of Dolma (primarily Project Gutenberg) and code documents with 8k+ tokens from The Stack, each contributing 20% to the ultimate mixture. For brief-context datasets, we streamlined the unique SmolLM2 pre-training mix to incorporate 20% FineWeb-Edu, 20% DCLM, and 20% from our math dataset (to be released soon). The mathematics dataset was upsampled to mitigate a performance drop observed on GSM8k in the course of the context extension process.
All experiments were implemented using the EasyContext repository.
Checkpoint Selection
For our training run, we saved checkpoints every 25 optimization steps, allowing us to guage and potentially recuperate the model’s state at different points in training. This practice is crucial for identifying the optimal model version, as training longer doesn’t all the time guarantee higher performance.
We evaluated the performance across multiple vision-language benchmarks, each weighted in line with their importance. The core benchmarks included the next:
- General multimodal understanding (MMMU and MMStar) that are probably the most comprehensive benchmark.
- Document and text-based visual query answering (DocVQA and TextVQA)
- Mathematical Reasoning (MathVista)
- Diagram understanding (AI2D)
To pick the optimal checkpoint, we created a single metric by combining these benchmarks with different manually assigned weights to reflect their relative importance in assessing the model’s capabilities. We used this single metric to pick one of the best checkpoint. Generally, the models tended to do great on most benchmarks with more training, but their relative performance on DocVQA would decrease considerably.
Tremendous-tuning
You’ll be able to fine-tune SmolVLM using transformers and apply alignment techniques using TRL 🚀
We offer a notebook to fine-tune it on the VQAv2 dataset, optionally using LoRA, QLoRA or full fine-tuning. Within the notebook, you’ll find some tricks to save lots of up much more memory and have a bigger batch size to suit SmolVLM inside consumer GPUs, like L4, for training. With batch sizes of 4, 8-bit loading with QLoRA and gradient checkpointing we will fine-tune in L4, and it consumes around ~16 GBs of VRAM. This makes it possible to fine-tune your SmolVLM using Colab! You’ll be able to mess around with the parameters to get a pleasant point in training duration-memory trade-off.
SmolVLM also comes with TRL integration so you’ll be able to apply Direct Preference Optimization (DPO) easily through the CLI. Start by running pip install trl speed up peft after which run the next command to fine-tune on RLAIF-V dataset:
speed up launch
--config_file examples/accelerate_configs/multi_gpu.yaml examples/scripts/dpo_vlm.py
--dataset_name HuggingFaceH4/rlaif-v_formatted
--model_name_or_path HuggingFaceTB/SmolVLM-Instruct
--per_device_train_batch_size 8
--gradient_accumulation_steps 32
--dataset_num_proc 32
--output_dir dpo_smolvlm_rlaif-v
--bf16 --torch_dtype bfloat16
--use_peft --lora_target_modules=all-linear
The resulting LoRA adapter weights are SmolVLM-Instruct-DPO. An in depth tutorial on preference tuning vision-based LLM could be found here: dpo_vlm.
Citation information
You’ll be able to cite us in the next way:
@article{marafioti2025smolvlm,
title={SmolVLM: Redefining small and efficient multimodal models},
creator={Andrés Marafioti and Orr Zohar and Miquel Farré and Merve Noyan and Elie Bakouch and Pedro Cuenca and Cyril Zakka and Loubna Ben Allal and Anton Lozhkov and Nouamane Tazi and Vaibhav Srivastav and Joshua Lochner and Hugo Larcher and Mathieu Morlon and Lewis Tunstall and Leandro von Werra and Thomas Wolf},
journal={arXiv preprint arXiv:2504.05299},
12 months={2025}
}
Wrapping Up
We introduced SmolVLM, a completely open, small, and mighty VLM for the community! We also provide tools for the community to make use of and customize it. We’re looking forward to seeing what you’ll create with SmolVLM.
Below are some resources when you would really like to read more about all things related to SmolVLM.



