SmolVLM Grows Smaller – Introducing the 256M & 500M Models!

We’re excited to announce two latest additions to the SmolVLM family: SmolVLM-256M and SmolVLM-500M. That’s right—256M parameters, making it the smallest Vision Language Model on this planet!

We built on every little thing we learned from SmolVLM 2B while specializing in efficiency, data mixtures, and latest design trade-offs. We’re excited to introduce a pair of models that preserve strong multimodal performance in a fraction of the footprint.

This release comes with 4 checkpoints: two base models and two instruction fine-tuned models with sizes 256M and 500M parameters. These models will be loadable on to transformers, MLX and ONNX, and now we have demos for transformers and WebGPU (with ONNX). You could find all of the models and the demo for this release here.

Benchmarks

Overview

SmolVLM-256M – The world’s smallest VLM!
SmolVLM-500M – A half-billion-parameter sibling that gives a major performance bump while still remaining super lightweight.
Latest Vision Encoder Selections – We compared SigLIP 400M SO (utilized in SmolVLM 2B and plenty of other large VLMs) against a smaller SigLIP base patch-16/512. Surprisingly, the larger encoder offered only marginally higher results, so we opted for the 93M-parameter SigLIP base patch-16/512 in these latest releases.
Larger Image Resolution – Our smaller vision encoder processes images at a bigger resolution (inspired by Apple’s VLM research and Google’s PaliGemma). This yields sharper image understanding with minimal overhead.
Training Optimization – A brand new tokenization trick significantly boosted real-world benchmarks, despite the fact that it made the training loss look worse on paper.

We’re now reaching model parity with the SmolLM2 family (135M, 360M, 1.7B), so you’ve got an entire set of smaller LLM + VLM combos to play with.

Why Go Smaller?

Once we released SmolVLM 2B, the community response was implausible: The model may be very light weight, open-source and permissive, and simple to integrate into existing workflows. But we desired to push this approach even further for people working with constrained devices, consumer laptops, and even potentially browser-based inference. That’s where our latest 256M and 500M models are available. On the opposite side, for people attempting to process huge amounts of information, these models can run at a fraction of the price of the 2B model.

Within the last 12 months, we trained two 80B VLMs and reduced them to 8B. Then for SmolVLM we took the challenge of reducing that 2B. And what we learned was that we could push the frontier way further! We’re excited to indicate that at 256M and 500M we are able to still get great performance. Our latest 256M model is the smallest VLM ever released, yet it surpasses the performance of our Idefics 80B model from just 17 months ago.

Benchmarks

Meet the 256M Parameter Giant

With just 256 million parameters, this model stands because the tiniest VLM ever. Despite its small size, it packs a surprising punch. It’s greater than capable on many multimodal tasks, including:

Captioning: Describing images or short videos.
Document Q&A: Answering questions on PDFs or scanned text.
Basic Visual Reasoning: Answering questions on charts or diagrams.

A Step Up: 500M

If you happen to need more performance headroom while still keeping the memory usage low, SmolVLM-500M is our half-billion-parameter compromise. It’s significantly smaller than the previous 2B release yet manages to push scores on tasks like DocVQA and MMMU closer to the larger models. We also found this model to be more robust to prompting, which makes it out-of-the-box higher fitted for production. But each models do great when fine-tuned.

We’ve got visualized the throughput gains across different batch sizes in below graph. Below numbers are throughput benchmarks ran on A100.

Benchmarks

What Modified Since SmolVLM 2B?

1. Vision Encoder Selections
Previously, we used the usual SigLIP 400M SO vision backbone, the identical one present in many VLM architectures. For these smaller models, we experimented with two setups:

SigLIP 400M SO: Higher capability, great performance.
SigLIP base patch-16/512 (93M): Much smaller, surprisingly close performance.

We found the performance gap wasn’t sufficiently big to justify the heavier encoder for our 256M and 500M models. So, we decided to go small on the vision encoder, too. As a bonus, the smaller encoder processes images at a bigger resolution, which (per research from Apple and Google) can often yield higher visual understanding without ballooning parameter counts.

2. Data mixture update

Similarly to our previous release, we depend on The Cauldron and Docmatix with the addition of MathWriting to the combo.

Data mixture

The proportions of the datasets were adjusted to put a stronger emphasis on document understanding (41%) and image captioning (14%), while still maintaining a balanced give attention to other essential areas akin to visual reasoning, chart comprehension, and general instruction following.
With this update the model is built on a robust document understanding basis and lets the door open to fine-tunes that can adjust its understanding of specific tasks.

3. Tokenization optimizations

We increased the pixel shuffle much more! Our latest models encode images at a rate of 4096 pixels per token, in comparison with 1820 pixels per token within the 2B model.

To optimize the tokenization much more, we added special tokens to represent our sub-image separators in a more efficient way. Which means now as an alternative of a string like being mapped to 7 tokens, it’s mapped to a single token. We did the identical for strings as much as . This led to a sizeable improvement within the model’s stability during training and quality of the outcomes. More details were documented on this LinkedIn post.

4. Completing the SmolLM2-SmolVLM family

SmolLM2 got here in three sizes: 135M, 360M, and 1.7B. With the 2 models we’re releasing today, we now have an entire set of smaller LLM + VLM combos to play with.

Smaller Multimodal Retrieval: ColSmolVLM 256M & 500M

We also found that it’s surprisingly easy to fine-tune and experiment. The team behind the ColBERT-like retrieval models have trained ColSmolVLM, delivering SOTA multimodal retrieval speeds with performance rivaling models 10x their size. SmolVLM makes it faster and cheaper to construct searchable databases. We predict the 256M model can grow to be an awesome specialized model for a lot of tasks. Find the link on use the brand new ColSmolVLM with the brand new SmolVLM models in Next Steps.

Benchmarks

SmolDocling

We partnered with IBM to construct models for Docling. Their early results with the 256M model are impressive. Below are some early examples they shared with us. Stay tuned for more updates on this!

Benchmarks

Using Smaller SmolVLM

Newer SmolVLMs are working out-of-the-box with the old SmolVLM code, so you should use transformers and MLX for inference and fine-tuning, and TRL for alignment 🚀 Furthermore, this release also comes with ONNX checkpoints.

Start with SmolVLM using transformers like below.

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq


processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-500M-Instruct",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
)


messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]


prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")


generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

Use SmolVLM with MLX by running the next CLI command:

python3 -m mlx_vlm.generate --model HuggingfaceTB/SmolVLM-500M-Instruct --max-tokens 400 --temp 0.0 --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/important/vlm_example.jpg --prompt "What's on this image?"

MLX

You possibly can play with the WebGPU demos for the SmolVLM-256M-Instruct and SmolVLM-500M-Instruct.

Find links to fine-tuning and multimodal RAG with ColSmolVLM on the Next Steps.

Citation information

You possibly can cite us in the next way:

@article{marafioti2025smolvlm,
  title={SmolVLM: Redefining small and efficient multimodal models}, 
  writer={Andrés Marafioti and Orr Zohar and Miquel Farré and Merve Noyan and Elie Bakouch and Pedro Cuenca and Cyril Zakka and Loubna Ben Allal and Anton Lozhkov and Nouamane Tazi and Vaibhav Srivastav and Joshua Lochner and Hugo Larcher and Mathieu Morlon and Lewis Tunstall and Leandro von Werra and Thomas Wolf},
  journal={arXiv preprint arXiv:2504.05299},
  12 months={2025}
}

Next Steps

We would really like to thank ViDoRe team for training ColSmolVLM: Tony Wu, Manuel Faysse, and Joshua Lochner for the ONNX conversion and WebGPU demo and Vaibhav Srivastav for his assistance on this release.

Source link

SmolVLM Grows Smaller – Introducing the 256M & 500M Models!

Table of Contents

Overview

Why Go Smaller?

Meet the 256M Parameter Giant

A Step Up: 500M

What Modified Since SmolVLM 2B?

Smaller Multimodal Retrieval: ColSmolVLM 256M & 500M

SmolDocling

Using Smaller SmolVLM

Citation information

Next Steps

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Agentic AI Swarm Optimization using Artificial Bee Colonization (ABC)

Train 400x faster Static Embedding Models with Sentence Transformers

How I Optimized My Leaf Raking Strategy Using Linear Programming

Use any timm model with transformers

Six Lessons Learned Constructing RAG Systems in Production

SmolVLM Grows Smaller – Introducing the 256M & 500M Models!

Table of Contents

Overview

Why Go Smaller?

Meet the 256M Parameter Giant

A Step Up: 500M

What Modified Since SmolVLM 2B?

Smaller Multimodal Retrieval: ColSmolVLM 256M & 500M

SmolDocling

Using Smaller SmolVLM

Citation information

Next Steps

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.