Vision-Language-Motion Models for General Robot Control

-


We’ve ported the primary robotics foundation models to Hugging Face LeRobot! Each π0 and π0-FAST, developed by Physical Intelligence, are actually available within the LeRobot repository, bringing generalist robotic intelligence to the Hugging Face ecosystem. For those who are inquisitive about how Vision-Language-Motion (VLA) models differ from Vision-Language Models (VLMs) and the way actions are represented, dive into this blog post to search out out!

Explore the model collection and the PyTorch Version of the model in our repository:
Huggingface collection of Pi0 models | Huggingface collection of Pi0+FAST models | LeRobot repo




Introduction

Robert Heinlein suggests that a well-rounded person must be able to handling a big selection of tasks—each mental and physical—quite than being narrowly specialized in a single field. Drawing a parallel between a well-rounded person and machine intelligence: AI systems vary widely, but human intelligence excels in versatility—adapting to tasks, environments, and surprises. While large language and vision-language models (LLMs, VLMs) show promise, they lack interaction with the physical world. To bridge this gap, we’d like models trained on robotic data. Generalist robot models can enhance adaptability, using diverse data to enhance generalization and robustness. As an alternative of coaching on isolated tasks, pre-training on varied robotic data—much like LLMs—boosts efficiency and performance.

Developing generalist robot policies, or robot foundation models, presents three key challenges:

  1. The necessity for large-scale research to completely leverage pre-training advantages.

  2. Designing model architectures that may integrate diverse data sources while capturing complex physical interactions. A key challenge on this regard is cross-embodiment training, where a model must learn from diverse robot types with various configurations, control spaces, and motion representations. Existing approaches tackle this by:

    • Combining multimodal datasets from different robotic platforms to boost generalization.
    • Using shared representations to bridge the gap between distinct robot morphologies, corresponding to single-arm, dual-arm, and mobile manipulators.
  3. Crafting an efficient training recipe, as recent advances in NLP and vision have heavily relied on careful pre-training and post-training strategies.

On this post, we introduce π0 and π0-FAST, prototype models and learning frameworks developed by Physical Intelligence, designed to beat these challenges.




🔍 What’s π0?

Paper | Jax Code

π0 (Pi-Zero) is a Vision-Language-Motion (VLA) model, developed by the Physical Intelligence team designed for generalist robot control. It builds upon large-scale pretraining and flow matching-based motion generation, enabling robots to perform dexterous manipulation tasks across different embodiments.

π0 is trained on data from 7 robotic platforms and 68 unique tasks, demonstrating strong zero-shot and fine-tuned performance on complex, real-world tasks corresponding to laundry folding, table bussing, grocery bagging, box assembly, and object retrieval.

Unlike standard robotic policies, π0 employs flow matching to provide smooth, real-time motion trajectories at 50Hz, making it highly efficient, precise, and adaptable for real-world deployment. Flow matching was utilized in continuous normalizing flows and improved generation quality in diffusion models. The denoising process π0 used works in the identical way, starting with a random noise that progressively converges towards a sequence of motor actions that make sense.



Find out how to Use π0 in LeRobot?

Initially, it is advisable upgrade your lerobot install, which leverages transformers as a dependency now! Simply do after a git clone

pip install -e ".[pi0]"

π0 models are foundational models that, very similar to PaliGemma, are made to be adapted to quite a lot of frameworks, environments, and scene inputs. The bottom models listed below are usable as-is, particularly π0.



Inference on π0 pretrained model

python lerobot/scripts/eval.py 
--pretrained_policy.path=/path/to/pretrained/pi0

Nonetheless, the performances are reduced because it’s a conversion from jax to torch and from a particular environment. We recommend fine-tuning your individual π0 to your individual environment, like below.



Advantageous-tuning the π0 Pretrained Model

To fine-tune the π0 model using the pi0_base checkpoint from openpi, run the next command:

python lerobot/scripts/train.py 
--policy.path=lerobot/pi0 
--dataset.repo_id=danaaubakirova/koch_test

To fine-tune the π0 neural network with PaliGemma and Expert Gemma, which were pretrained using VLM default parameters before π0 fine-tuning, execute:

python lerobot/scripts/train.py 
--policy.type=pi0 
--dataset.repo_id=danaaubakirova/koch_test

You can even use the pretrained π0 model independently from the LeRobot training framework with the next code:

policy = Pi0Policy.from_pretrained("lerobot/pi0")



What’s the difference between VLMs and VLAs?

Vision-Language Models (VLMs) and Vision-Language-Motion Models (VLAs) share a standard foundation: transformers. Nonetheless, the important thing distinction lies in motion representation. While VLMs process and generate multimodal representations (images and text), VLAs extend this by incorporating motion and statement state tokens. With these additional tokens in place, the following challenge is knowing how attention is computed.



Attention Mechanisms in Robotics Policies

Let’s expand our vocabulary and introduce key terms:



State Token

  • It’s a single token that represents the robot’s current environment state (e.g., joint angles, sensor values, or other relevant observations).
  • The masking rules allow this token to attend to the prefix’s image and text, meaning the state token can “see” any visual or textual cues mandatory for decision-making.
  • It also attends to previous states in a triangular manner. If multiple state tokens are used, each recent state token can see older ones but not vice versa.



Motion Tokens

  • Represent the motor command sequence.
  • Have full visibility over all the pieces except padding regions. This implies each motion token can attend to:
    • All non-padding image tokens (your complete scene),
    • All non-padding text tokens (instructions or descriptions),
    • State tokens (each current and former),
    • Other motion tokens.



Prefix Tokens

  • Represent the full scene and fully attend to one another, much like PaliGemma.



Key Idea

These tokens encapsulate:

  • The robot’s internal representation of the environment (state),
  • The commands or controls the robot issues (motion),
  • An encoding of time or step index (time embedding).

They’re appended after the prefix portion (images + text), so the prefix serves as context (e.g., a scene image, language instructions like “be an excellent robot” or “transfer the cube”), while the suffix captures policy‐specific features.




⚡ Towards the Faster Attention in π0

Nonetheless, efficiently handling attention in π0 comes with its own set of challenges. The unique shape of its attention mask influences how attention is computed—let’s dive into the main points!



Handling 2D Attention Masks

The resulting 2D causal mask exhibits strong block sparsity, but defining the boundaries of every block—especially in a batch of samples—is a bit tricky. We’re used to causal masks with triangular structures for autoregressive modeling, but this isn’t considered one of those cases.

As you’ll be able to see in this instance below: the image (first element) has some padding tokens, representing empty cameras. Then, text tokens and state tokens are added. This “prefix” part forms a totally noncausal attention, as in PaliGemma. Then, the “suffix” (state + motion/time tokens) has a causal-block structure. The eager naive implementation performs matrix multiplications and applies softmax across your complete input, making it highly inefficient.

VLA Attention Mask

Figure 1: The visualization of the VLA attention mask



Can we use FlashAttention2?

  • FlashAttention2 provides a varlen interface, however the cu_seqlens (cumulative prefix lengths) should be computed manually. It’s designed for contiguous (or strictly causal) attention patterns with uniform query and key lengths.
  • It doesn’t naturally handle irregular block masks or arbitrary per-token “allowed” positions, which is precisely what we’d like.
  • So, while it’s possible to make use of it at some cost of implementation, we decided to show to…



Using FlexAttention in PyTorch

This looks like a FlexAttention job! It has a pure PyTorch interface, by which explored two options:

  1. Adding a score_mod to our causal mask in positions where attention is tuned out. Nonetheless, even a scalar addition significantly decreases FlexAttention’s performance. It is because the score_mod in our case is added outside of the optimized cuda kernel.
  2. The proper option is indexing our causal mask and passing the resulting signature to create a block mask. This block mask efficiently indicates where the eye needs to be computed and where it may well be skipped entirely.

causal_mask = generate_causal_mask(your_samples) 

def precomputed_mask_factory(precomputed_mask: torch.Tensor) -> _mask_mod_signature:
    def mask_mod(b, h, q_idx, kv_idx):
        return precomputed_mask[b][h][q_idx][kv_idx]
    return mask_mod
flex_attention_output = flex_attention(query, key, value, mask_mod=mask_mod)

mask_mod = precomputed_mask_factory(causal_mask)

block_mask = create_block_mask(
    mask_mod=mask_mod,
    
)


attn_output, attention_weights = flex_attention(
    query,
    key,
    value,
    block_mask=block_mask,
)

The present implementation runs, and a WIP is to have it support torch.compile and leverage it to the fullest!



Find out how to effectively represent Actions?

Now that we all know actions are simply n-dimensional vectors that could be tokenized, we are able to explore the challenges of motion representation in Vision-Language-Motion (VLA) models. The best way actions are represented directly impacts efficiency, generalization, and execution fidelity.

One approach is semantic motion representation, where actions are described as high-level concepts like sub-tasks or keypoints. While this enables for few-shot and zero-shot learning, it often relies on hand-designed low-level controllers, limiting flexibility across different robots. In contrast, low-level control representations map actions on to motor commands, enabling precise movements but making training less stable and harder to scale.

Most existing VLAs use discrete motion tokenization, converting continuous actions into discrete tokens generated autoregressively. Essentially the most common method—per-dimension, per-timestep binning—struggles with high-frequency control tasks, resulting in lossy representations and inefficient training. Alternatives like vector quantization (VQ) and time-series compression help, but VQ is sensitive to hyperparameters, making it less reliable for diverse robot designs.

To deal with these issues, Frequency-space Motion Sequence Tokenization (FAST) introduces a novel time-series compression approach using the Discrete Cosine Transform (DCT). FAST reduces redundancy, improves efficiency, and enhances motion fidelity.

With this, we present π0-FAST, faster and autoregressive version of π0 also available in Lerobot repo, an extension of π0, which leverages this recent tokenizer for higher motion representation.




🚀 What’s π0-FAST?

Paper | Jax Code | Our implementation in Lerobot

π0-FAST is an autoregressive version of π0, introducing **FAST (Frequency-space Motion Sequence Tokenization)**—a brand new tokenization scheme that enhances efficiency and performance.



Key Benefits of π0-FAST:

  • 5x faster training in comparison with diffusion-based VLAs.
  • Improved motion representation, reducing redundancy in motion sequences.
  • Stronger generalization across unseen environments and robot morphologies.

🔗 The π0-FAST tokenizer could be accessed here: FAST Tokenizer

🔗 Pretrained weights could be accessed here: Pi0+FAST




How does FAST work?

FAST uses the Discrete Cosine Transform (DCT) to compress continuous motion sequences into discrete tokens. The method, illustrated in Figure 2, begins with normalizing raw robot actions, mapping the first and 99th quantiles of every motion dimension to the range [-1,1]. This normalization is used to make sure consistency across different robotic systems and improve robustness against outliers.

Each motion dimension is then transformed independently using DCT, converting the time-domain signal into the frequency domain. To cut back redundancy, insignificant coefficients are removed through a scale-and-round operation, where a hyperparameter balances compression rate and reconstruction accuracy. The resulting DCT coefficient matrix, often sparse, is flattened right into a one-dimensional sequence of integers, interleaving low-frequency components first across dimensions to preserve critical information.

To further compress the sequence, Byte Pair Encoding (BPE) is applied. As usual, BPE merges steadily occurring patterns across dimensions while maintaining a fixed-size vocabulary.

image/png

Figure 2: The FAST motion tokenization pipeline

Since all operations are invertible, actions could be reconstructed efficiently and losslessly from tokens. The tokenization pipeline has only two hyperparameters: the scaling coefficient applied before rounding and the BPE vocabulary size. Each parameters remain robust across different datasets.

Moreover, a universal version of FAST, called FAST+, has been trained on a million motion sequences from single-arm, bimanual, and mobile manipulation robots, making it applicable across diverse robotic setups. FAST+ is offered as a Hugging Face AutoProcessor, allowing users to tokenize motion sequences with just just a few lines of code.

For optimal compression, input actions must be quantile-normalized to [-1,1] before tokenization. With the AutoProcessor module, the users can train a custom FAST tokenizer on their very own datasets.




Find out how to use FAST tokenizer?

🔗 Code for the usage and training custom motion tokenizers within the official FAST Repo

FAST is integrated into Hugging Face Transformers and could be easily used for encoding and decoding robot motion sequences.



What’s Next for Generalist Robot Intelligence?

With π0 and π0-FAST, we take a major step towards generalist robot intelligence, bringing scalable, efficient, and versatile Vision-Language-Motion (VLA) models to LeRobot. By leveraging FAST tokenization, we enhance motion representation, enabling robots to perform a various range of tasks with higher efficiency and flexibility. These steps open the door for future multi-embodiment, real-time robotic policies, pushing the boundaries of what robots can achieve in the actual world. 🚀



Additional Resources



References

@book{heinlein2021time,
  title={Time enough for love},
  writer={Heinlein, Robert A},
  12 months={2021},
  publisher={Penguin}
}

@article{black2024pi_0,
  title={$$backslash$pi_0 $: A Vision-Language-Motion Flow Model for General Robot Control},
  writer={Black, Kevin and Brown, Noah and Driess, Danny and Esmail, Adnan and Equi, Michael and Finn, Chelsea and Fusai, Niccolo and Groom, Lachy and Hausman, Karol and Ichter, Brian and others},
  journal={arXiv preprint arXiv:2410.24164},
  12 months={2024}
}

@article{pertsch2025fast,
  title={FAST: Efficient Motion Tokenization for Vision-Language-Motion Models},
  writer=a and Levine, Sergey,
  journal={arXiv preprint arXiv:2501.09747},
  12 months={2025}
}



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x