Efficient Vision-Language-Motion Model trained on Lerobot Community Data

-


Today, we introduce SmolVLA, a compact (450M), open-source Vision-Language-Motion model for robotics that runs on consumer hardware.

  • Pretrained only on compatibly licensed, open-source community-shared datasets under the lerobot tag.
  • SmolVLA-450M outperforms much larger VLAs and powerful baselines corresponding to ACT on simulation (LIBERO, Meta-World) and real-world tasks (SO100, SO101).
  • Supports asynchronous inference for 30% faster response and 2× task throughput.

Useful links:



📚 Table of Contents



Introduction

Over the past few years, Transformers have driven remarkable progress in AI, from language models able to human-like reasoning to multimodal systems that understand each images and text. Nonetheless, in real-world robotics, advancements have been much slower. Robots still struggle to generalize across diverse objects, environments, and tasks. This limited progress stems from a lack of high-quality, diverse data and the absence of models that may reason and act like humans within the physical world.

In response to those challenges, the sector has recently turned to vision-language-action (VLA) models, which aim to unify perception, language understanding, and motion prediction inside a single architecture. VLAs typically take as input raw visual observations and natural language instructions, and output corresponding robot actions. While promising, much of the recent progress in VLAs stays locked behind proprietary models trained on large-scale private datasets, often requiring costly hardware setups and extensive engineering resources.
In consequence, the broader robotics research community faces significant barriers in reproducing and constructing upon these models.

SmolVLA addresses this gap by offering an open-source, compact, and efficient VLA model that will be trained on consumer-grade hardware using only publicly available datasets. By releasing not only model weights but in addition using very reasonably priced open-source hardware, SmolVLA goals to democratize access to vision-language-action models and speed up research toward generalist robotic agents.

Comparison of SmolVLA across task variations.

Figure 1: Comparison of SmolVLA across task variations. From left to right: (1) asynchronous pick-place cube counting, (2) synchronous pick-place cube counting, (3) pick-place cube counting under perturbations, and (4) generalization on pick-and-place of the lego block with real-world SO101.



Meet SmolVLA!

SmolVLA-450M is our open-source, compact yet capable VLA model. It’s:

  • Sufficiently small to run on CPU, train on a single consumer GPU, or perhaps a MacBook!
  • Trained on public, community-shared robotics data
  • Released with full training and inference recipes
  • Will be tested and deployed on very reasonably priced hardware (SO-100, SO-101, LeKiwi, etc.)

Inspired by the training paradigms of Large Language Models (LLMs), SmolVLA goes through a pretraining phase on general manipulation data, followed by task-specific post-training. Architecturally, it combines Transformers with flow-matching decoders, and is optimized for speed and low-latency inference with the next design selections:

  • Skipping half of the layers of the vision model for faster inference and smaller size
  • Interleaving self-attention and cross-attention blocks
  • Using fewer visual tokens
  • Leveraging smaller pretrained VLMs

Despite using fewer than 30k training episodes—an order of magnitude lower than other VLAs—SmolVLA matches or exceeds the performance of much larger models, each in simulation and the actual world.

To make real-time robotics easier to make use of, we introduce an asynchronous inference stack. This technology separates how robots perform actions from how they understand what they see and listen to. For this reason separation, robots can respond more quickly in fast-changing environments.

SmolVLA architecture.

Figure 2. SmolVLA takes as input a sequence of RGB images from multiple cameras, the robot’s current sensorimotor state, and a natural language instruction. The VLM encodes these into contextual features, which condition the motion expert to generate a continuous sequence of actions.



🚀 Use SmolVLA?

SmolVLA is designed to be easy to make use of and integrate—whether you are finetuning on your individual data or plugging it into an existing robotics stack.



Install

First, install the required dependencies:

git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[smolvla]"



Finetune the pretrained model

Use smolvla_base, our pretrained 450M model, with the lerobot training framework:

python lerobot/scripts/train.py 
  --policy.path=lerobot/smolvla_base 
  --dataset.repo_id=lerobot/svla_so100_stacking 
  --batch_size=64 
  --steps=20000  



Train from scratch

​​In the event you’d wish to construct from the architecture (pretrained VLM + motion expert) reasonably than a pretrained checkpoint:

python lerobot/scripts/train.py 
  --policy.type=smolvla 
  --dataset.repo_id=lerobot/svla_so100_stacking 
  --batch_size=64 
  --steps=200000

You can even load SmolVLAPolicy directly:

from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla_base")



Method

SmolVLA is just not only a light-weight yet capable model, but in addition a way for training and evaluating generalist robotics policies. On this section, we introduce the model architecture behind SmolVLA and the asynchronous inference setup used for evaluation, which has proven to be more adaptable and able to faster recovery.

SmolVLA consists of two core components: a Vision-Language Model (VLM) that processes multimodal inputs and an motion expert that outputs robot control commands. Below, we share the small print of the primary components of SmolVLA architecture and the Asynchronous Inference. More details will be present in our technical report.



Primary Architecture



Vision-Language Model (VLM)

We use SmolVLM2 as our VLM backbone. It’s optimized for multi-image inputs and consists of a SigLIP vision encoder and a SmolLM2 language decoder.

  • Image tokens are extracted via the vision encoder
  • Language instructions are tokenized and fed directly into the decoder.
  • Sensorimotor states are projected right into a single token using a linear layer to align with the token dimension of the language model.

The decoder layers process concatenated image, language, and state tokens. The resulting features are then passed to the motion expert.



Motion Expert: Flow Matching Transformer

SmolVLA’s motion expert is a compact transformer (~100M parameters) that generates motion chunks, i.e. sequences of future robot actions, conditioned on the VLM’s outputs. It’s trained using a flow matching objective, which teaches the model to guide noisy samples back to the bottom truth. In contrast, while discrete motion representations (e.g., via tokenization) are powerful, they often require autoregressive decoding, which is slow and inefficient at inference time. While flow matching allows direct, non-autoregressive prediction of continuous actions, enabling real-time control with high precision.

More intuitively, during training, we add random noise to the robot’s real motion sequences and ask the model to predict the “correction vector” that brings them back to the proper trajectory. This forms a smooth vector field over the motion space, helping the model learn accurate and stable control policies.

We implement this using a transformer architecture with interleaved attention blocks (see the figure 2), and reduce its hidden size to 75% of the VLM’s, keeping the model lightweight for deployment.



Design Selections for Efficiency and Robustness

While combining a vision-language model with an motion prediction module is a typical design pattern in recent VLA systems—corresponding to Pi0, GR00T, Diffusion Policy — we identified several architectural selections that significantly enhance the robustness and performance. In SmolVLA, we apply three key techniques: reducing the variety of visual tokens, skipping upper layers within the VLM, and interleaving cross- and self-attention layers within the motion expert.



Visual Token Reduction

High-resolution images improve perception but can significantly decelerate inference. To strike a balance, SmolVLA limits the variety of visual tokens to 64 per frame during each training and inference. For instance, a 512×512 image is compressed into just 64 tokens, as a substitute of 1024, using PixelShuffle as an efficient shuffling technique. While the underlying Vision-Language Model (VLM) was originally pretrained using image tiling for broader coverage, SmolVLA uses only the worldwide image at runtime to maintain inference lightweight and fast.



Faster Inference via Layer Skipping

Slightly than at all times counting on the ultimate layer of the VLM—which will be expensive and sometimes suboptimal—we use features from intermediate layers. Prior work has shown that early layers often provide higher representations for downstream tasks.
In SmolVLA, the motion expert only attends to VLM features as much as a configurable layer NN during training, set to half the overall layers. This halves the compute cost of each the VLM and the motion expert, significantly speeding up inference with minimal performance loss.



Interleaved Cross and Self-Attention

Contained in the motion expert, attention layers alternate between:

  • Cross-attention (CA), where motion tokens attend to the VLM’s features
  • Self-attention (SA), where motion tokens attend to one another (causally—only to the past)

We found that this interleaved design is each lighter and simpler than using full attention blocks. Models that rely only on CA or only on SA are inclined to sacrifice either smoothness or grounding.

In SmolVLA, CA ensures that actions are well-conditioned on perception and directions, while SA improves temporal smoothness—especially critical for real-world control, where jittery predictions may end up in unsafe or unstable behavior.



Asynchronous Inference

Asynchronous inference

Figure 3. Asynchronous inference. Illustration of the asynchronous inference stack. Note that the policy will be run on a distant server, possibly with GPUs.

Modern visuomotor policies output motion chunks—sequences of actions to execute. There are two ways to administer them:

  • Synchronous (sync): The robot executes a bit, then pauses while the subsequent one is computed. Easy, but causes a delay where the robot cannot react to latest inputs.
  • Asynchronous (async): While executing the present chunk, the robot already sends the most recent remark to a Policy Server (possibly hosted on GPU) for the subsequent chunk. This avoids idle time and improves reactivity.

Our async stack decouples motion execution from chunk prediction, leading to higher adaptability, and the whole lack of execution lags at runtime. It relies on the next key mechanisms:

  • 1. Early trigger: When the queue length falls below a threshold (e.g., 70%), we send an remark to a Policy Server, calling for a brand new motion chunk.
  • 2. Decoupled threads: Control loop keeps executing → inference happens in parallel (non-blocking).
  • 3. Chunk fusion: Overlapping actions from successive chunks are stitched with a straightforward merge rule to avoid jitter.

We’re really enthusiastic about releasing asynchronous inference since it guarantees greater adaptability and improved performance without changing the model. In brief, async inference keeps the robot responsive by overlapping execution and distant prediction.


Community Datasets

While vision and language models thrive on web-scale datasets like LAION, ImageNet, and Common Crawl, robotics lacks a comparable resource. There’s no “Web of robots.” As a substitute, data is fragmented across robot types, sensors, control schemes, and formats—forming disconnected “data islands”. In our previous post, we explored how this fragmentation might be resolved through open, collaborative efforts. Just as ImageNet catalyzed breakthroughs in computer vision by providing a big, diverse benchmark, we consider that community-driven robotics datasets can play the identical foundational role for generalist robot policies.

SmolVLA is our first step toward that vision: It’s pretrained on a curated mixture of publicly available, community-contributed datasets designed to reflect real-world variation. Slightly than optimizing for dataset size alone, we deal with diversity: a spread of behaviors, camera viewpoints, and embodiments that promote transfer and generalization.

All training data utilized in SmolVLA comes from LeRobot Community Datasets , robotics datasets shared on the Hugging Face Hub under the lerobot tag. Collected in diverse settings, from labs to living rooms, these datasets represent an open, decentralized effort to scale real-world robot data.

A glimpse of the community dataset.

Figure 4. A glimpse of the community dataset. Special because of Ville Kuosmanen for creating the visualization.
Unlike academic benchmarks, community datasets naturally capture messy, realistic interactions: varied lighting, suboptimal demonstrations, unconventional objects, and heterogeneous control schemes. This sort of diversity can be very useful for learning robust, general-purpose representations.

We used a customfiltering tool created by Alexandre Chapin and Ville Kuosmanen to pick datasets based on frame count, visual quality, and task coverage. After a meticulous manual review (special because of Marina Barannikov), we curated a set of 487 high-quality datasets focused on the SO100 robotic arm, standardized at 30 FPS. This yielded around 10 million frames—at the least one order of magnitude smaller than other popular benchmark datasets, yet significantly more diverse.



Improving Task Annotations

A standard issue across community datasets was noisy or missing task descriptions. Many episodes lacked annotations or included vague labels like “task desc” or “Move”, “Pick”. To enhance quality and standardize the textual input across datasets, we used Qwen2.5-VL-3B-Instruct to generate concise, action-oriented descriptions.

Given sample frames and the unique label, the model was prompted to rewrite the instruction in under 30 characters, starting with an motion verb (e.g., “Pick,” “Place,” “Open”).

The prompt used is:

Here's a current task description: {current_task}. Generate a really short, clear, and complete one-sentence describing the motion performed by the robot arm (max 30 characters). Don't include unnecessary words.
Be concise.
Listed below are some examples: Pick up the cube and place it within the box, open the drawer and so forth.
Start directly with an motion verb like “Pick”, “Place”, “Open”, etc.
Much like the provided examples, what's the primary motion done by the robot arm?



Standardizing Camera Views

One other challenge was inconsistent camera naming. Some datasets used clear names like top or wrist.right, while others used ambiguous labels like images.laptop, which varied in meaning.
To repair this, we manually went through the datasets and mapped each camera view to a standardized scheme:
OBS_IMAGE_1: Top-down view
OBS_IMAGE_2: Wrist-mounted view
OBS_IMAGE_3+: Additional viewpoints

We further isolate the contributions of community dataset pretraining and multitask finetuning. Without pretraining on the LeRobot community datasets, SmolVLA initially achieves 51.7% success on SO100. After pretraining on community-collected data, performance jumps to 78.3%, a +26.6% absolute improvement. Multitask finetuning further boosts performance, showing strong task transfer capabilities even in low-data regimes.

Table 1. Impact of Pretraining on Community Datasets and Multitask Finetuning.



Results

We evaluate SmolVLA across simulation and real-world benchmarks to check its generalization, efficiency, and robustness. Despite being compact, It consistently outperforms or matches the performance of significantly larger models and policies pretrained on higher-scale robotics data.

SmolVLA Performance on Simulation Benchmarks.

Table 2. SmolVLA Performance on Simulation Benchmarks.

SmolVLA vs Baselines on Real-World Tasks (SO100).

Table 3. SmolVLA vs Baselines on Real-World Tasks (SO100).

In real-world settings, SmolVLA is evaluated on two diverse suites: SO100 and SO101. These tasks include pick-place, stacking, and sorting, with each in-distribution and out-of-distribution object configurations.
On SO101, SmolVLA also excels in generalization:

Generalization of SmolVLA to New Embodiment (SO101) vs ACT..

Table 4. Generalization of SmolVLA to Latest Embodiment (SO101) vs ACT..

Finally, we evaluate SmolVLA under synchronous and asynchronous inference modes. Async inference decouples motion execution from model inference, allowing the policy to react while the robot is moving.

  • Each modes achieve similar task success (≈78%), but async inference:
    • Completes tasks ~30% faster (9.7s vs. 13.75s)
    • Enables 2× more completions in fixed-time settings (19 vs. 9 cubes)

This ends in more responsive and robust real-world performance, especially in dynamic environments with shifting objects or external disturbances.

Asynchronous vs. Synchronous Inference in Real-World Tasks.

Figure 5. Asynchronous vs. Synchronous Inference in Real-World Tasks.
(a) Task success rates (%), (b) average completion time(s), and (c) variety of tasks accomplished inside a hard and fast time window.



Conclusion

SmolVLA is our contribution to constructing robotics foundation models which might be open, efficient, and reproducible. Despite its small size, it matches or outperforms larger, proprietary models across a spread of real-world and simulated tasks. By relying solely on community-contributed datasets and reasonably priced hardware, SmolVLA lowers the barrier to entry for researchers, educators, and hobbyists alike.
But that is just the start. SmolVLA is greater than only a model — it’s a part of a growing open-source movement toward scalable, collaborative robotics.



Call to Motion:

  • 🔧 Try it out! Finetune SmolVLA on your individual data, deploy it on reasonably priced hardware, or benchmark it against your current stack and share it on twitter/linkedin.
  • 🤖 Upload the dataset! Got a robot? Collect and share your data using the lerobot format. Help expand the community dataset that powers SmolVLA.
  • 💬 Join the blog discussion. Drop your questions, ideas, or feedback within the discussion below. We’re completely satisfied to assist with integration, training, or deployment.
  • 📊 Contribute. Improve datasets, report issues, suggest latest ideas. Every contribution helps.
  • 🌍 Spread the word. Share SmolVLA with fellow researchers, developers, or educators inquisitive about efficient, real-time robotic policies.
  • 📫 Stay in contact: Follow the LeRobot organization and Discord server for updates, tutorials, and latest releases.

Together, we are able to make real-world robotics more capable, cheaper, and more open. ✨



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x