Bringing Video Understanding to Every Device

SmolVLM2 represents a fundamental shift in how we take into consideration video understanding – moving from massive models that require substantial computing resources to efficient models that may run anywhere. Our goal is easy: make video understanding accessible across all devices and use cases, from phones to servers.

We’re releasing models in three sizes (2.2B, 500M and 256M), MLX ready (Python and Swift APIs) from day zero.
We have made all models and demos available on this collection.

Wish to try SmolVLM2 straight away? Try our interactive chat interface where you may test visual and video understanding capabilities of SmolVLM2 2.2B through an easy, intuitive interface.

Technical Details

We’re introducing three recent models with 256M, 500M and a pair of.2B parameters. The two.2B model is the go-to selection for vision and video tasks, while the 500M and 256M models represent the smallest video language models ever released.

While they’re small in size, they outperform any existing models per memory consumption. Video-MME (the go-to scientific benchmark in video), SmolVLM2 joins frontier model families on the 2B range and we lead the pack within the even smaller space.

SmolVLM2 Performance

Video-MME stands out as a comprehensive benchmark attributable to its extensive coverage across diverse video types, various durations (11 seconds to 1 hour), multiple data modalities (including subtitles and audio), and high-quality expert annotations spanning 900 videos totaling 254 hours. Learn more here.

SmolVLM2 2.2B: Our Recent Star Player for Vision and Video

Compared with the previous SmolVLM family, our recent 2.2B model got higher at solving math problems with images, reading text in photos, understanding complex diagrams, and tackling scientific visual questions. This shows within the model performance across different benchmarks:

SmolVLM2 Vision Score Gains

In the case of video tasks, 2.2B is a very good bang for the buck. Across the assorted scientific benchmarks we evaluated it on, we would like to focus on its performance on Video-MME where it outperforms all existing 2B models.

We were in a position to achieve a very good balance on video/image performance because of the information mixture learnings published in Apollo: An Exploration of Video Understanding in Large Multimodal Models

It’s so memory efficient, you could run it even in a free Google Colab.

Python Code


!pip install git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2"
).to("cuda")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "path_to_video.mp4"},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

Going Even Smaller: Meet the 500M and 256M Video Models

No person dared to release such small video models until today.

Our recent SmolVLM2-500M-Video-Instruct model has video capabilities very near SmolVLM 2.2B, but at a fraction of the scale: we’re getting the identical video understanding capabilities with lower than 1 / 4 of the parameters 🤯.

After which there’s our little experiment, the SmolVLM2-256M-Video-Instruct. Consider it as our “what if” project – what if we could push the boundaries of small models even further? Taking inspiration from what IBM achieved with our base SmolVLM-256M-Instruct just a few weeks ago, we desired to see how far we could go along with video understanding. While it’s more of an experimental release, we’re hoping it’ll encourage some creative applications and specialized fine-tuning projects.

Suite of SmolVLM2 Demo applications

To exhibit our vision in small video models, we have built three practical applications that showcase the flexibility of those models.

iPhone Video Understanding

We have created an iPhone app running SmolVLM2 completely locally. Using our 500M model, users can analyze and understand video content directly on their device – no cloud required. Enthusiastic about constructing iPhone video processing apps with AI models running locally? We’re releasing it very soon – fill this type to check and construct with us!

VLC media player integration

Working in collaboration with VLC media player, we’re integrating SmolVLM2 to offer intelligent video segment descriptions and navigation. This integration allows users to go looking through video content semantically, jumping on to relevant sections based on natural language descriptions. While this is figure in progress, you may experiment with the present playlist builder prototype on this space.

Video Highlight Generator

Available as a Hugging Face Space, this application takes long-form videos (1+ hours) and routinely extracts essentially the most significant moments. We have tested it extensively with soccer matches and other lengthy events, making it a robust tool for content summarization. Try it yourself in our demo space.

Using SmolVLM2 with Transformers and MLX

We make SmolVLM2 available to make use of with transformers and MLX from day zero. On this section, you’ll find different inference alternatives and tutorials for video and multiple images.

Transformers

The best strategy to run inference with the SmolVLM2 models is thru the conversational API – applying the chat template takes care of preparing all inputs routinely.

You possibly can load the model as follows.


!pip install git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2"
).to("cuda")

Video Inference

You possibly can pass videos through a chat template by passing in {"type": "video", "path": {video_path}. See below for an entire example.

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "path_to_video.mp4"},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

Multiple Image Inference

Along with video, SmolVLM2 supports multi-image conversations. You should use the identical API through the chat template, providing each image using a filesystem path, an URL, or a PIL.Image object:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What are the differences between these two images?"},
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"},            
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

Inference with MLX

To run SmolVLM2 with MLX on Apple Silicon devices using Python, you need to use the wonderful mlx-vlm library.
First, that you must install mlx-vlm from this branch using the next command:

pip install git+https://github.com/pcuenca/mlx-vlm.git@smolvlm

You then can run inference on a single image using the next one-liner, which uses the unquantized 500M version of SmolVLM2:

python -m mlx_vlm.generate 
  --model mlx-community/SmolVLM2-500M-Video-Instruct-mlx 
  --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/primary/bee.jpg 
  --prompt "Are you able to describe this image?"

We also created an easy script for video understanding. You should use it as follows:

python -m mlx_vlm.smolvlm_video_generate 
  --model mlx-community/SmolVLM2-500M-Video-Instruct-mlx 
  --system "Focus only on describing the important thing dramatic motion or notable event occurring on this video segment. Skip general context or scene-setting details unless they're crucial to understanding the primary motion." 
  --prompt "What is occurring on this video?" 
  --video /Users/pedro/Downloads/IMG_2855.mov 
  --prompt "Are you able to describe this image?"

Note that the system prompt is significant to bend the model to the specified behaviour. You should use it to, for instance, describe all scenes and transitions, or to offer a one-sentence summary of what is going on on.

Swift MLX

The Swift language can also be supported through the mlx-swift-examples repo, which is what we used to construct our iPhone app.

Until our in-progress PR is finalized and merged, you could have to compile the project from this fork, after which you need to use the llm-tool CLI in your Mac as follows.

For image inference:

./mlx-run --debug llm-tool 
    --model mlx-community/SmolVLM2-500M-Video-Instruct-mlx 
    --prompt "Are you able to describe this image?" 
    --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/primary/bee.jpg 
    --temperature 0.7 --top-p 0.9 --max-tokens 100

Video evaluation can also be supported, in addition to providing a system prompt. We found system prompts to be particularly helpful for video understanding, to drive the model to the specified level of detail we’re inquisitive about. This can be a video inference example:

./mlx-run --debug llm-tool 
    --model mlx-community/SmolVLM2-500M-Video-Instruct-mlx 
    --system "Focus only on describing the important thing dramatic motion or notable event occurring on this video segment. Skip general context or scene-setting details unless they're crucial to understanding the primary motion." 
    --prompt "What is occurring on this video?" 
    --video /Users/pedro/Downloads/IMG_2855.mov 
    --temperature 0.7 --top-p 0.9 --max-tokens 100

In the event you integrate SmolVLM2 in your apps using MLX and Swift, we would like to find out about it! Please, be happy to drop us a note within the comments section below!

Tremendous-tuning SmolVLM2

You possibly can fine-tune SmolVLM2 on videos using transformers 🤗
We have now fine-tuned the 500M variant in Colab on video-caption pairs in VideoFeedback dataset for demonstration purposes. For the reason that 500M variant is small, it’s higher to use full fine-tuning as a substitute of QLoRA or LoRA, meanwhile you may attempt to apply QLoRA on cB variant. You could find the fine-tuning notebook here.

Citation information

You possibly can cite us in the next way:

@article{marafioti2025smolvlm,
  title={SmolVLM: Redefining small and efficient multimodal models}, 
  writer={Andrés Marafioti and Orr Zohar and Miquel Farré and Merve Noyan and Elie Bakouch and Pedro Cuenca and Cyril Zakka and Loubna Ben Allal and Anton Lozhkov and Nouamane Tazi and Vaibhav Srivastav and Joshua Lochner and Hugo Larcher and Mathieu Morlon and Lewis Tunstall and Leandro von Werra and Thomas Wolf},
  journal={arXiv preprint arXiv:2504.05299},
  12 months={2025}
}

We would love to thank Raushan Turganbay, Arthur Zucker and Pablo Montalvo Leroux for his or her contribution of the model to transformers.

We’re looking forward to seeing all of the belongings you’ll construct with SmolVLM2!
In the event you’d prefer to learn more in regards to the SmolVLM family of models, be happy to read the next:

SmolVLM2 – Collection with Models and Demos

Source link

Bringing Video Understanding to Every Device

Table of Contents

Technical Details

SmolVLM2 2.2B: Our Recent Star Player for Vision and Video

Going Even Smaller: Meet the 500M and 256M Video Models

Suite of SmolVLM2 Demo applications

iPhone Video Understanding

VLC media player integration

Video Highlight Generator

Using SmolVLM2 with Transformers and MLX

Transformers

Video Inference

Multiple Image Inference

Inference with MLX

Swift MLX

Tremendous-tuning SmolVLM2

Citation information

Read More

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Automobile Example

Bringing Video Understanding to Every Device

Table of Contents

Technical Details

SmolVLM2 2.2B: Our Recent Star Player for Vision and Video

Going Even Smaller: Meet the 500M and 256M Video Models

Suite of SmolVLM2 Demo applications

iPhone Video Understanding

VLC media player integration

Video Highlight Generator

Using SmolVLM2 with Transformers and MLX

Transformers

Video Inference

Multiple Image Inference

Inference with MLX

Swift MLX

Tremendous-tuning SmolVLM2

Citation information

Read More

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.