LatentVLA: Latent Reasoning Models for Autonomous Driving

, we discussed AlpamayoR1 (AR1), an autonomous driving model integrating a VLM to act as a reasoning backbone. It relies on a rigorously collected chain-of-causation dataset. Training on this dataset enables AR1 to “reason” in natural language to unravel difficult driving situations.

But what if natural language will not be the perfect support for reasoning in driving scenarios? In any case, when met with a driving situation that requires a right away response, human drivers generally act reflexively slightly than “reasoning in language step-by-step”. What’s the choice for driving models?

In this text, we break down the LatentVLA architecture, a convincing take against language-based approaches that requires no natural language dataset, performs reasoning within the latent space and uses knowledge distillation to fulfill real-time constraints.

Latent Motion Learning

A big a part of AR1’s success resides within the chain-of-causation dataset, the gathering of which required industrial-scale efforts, a rigorously elaborated labeling pipeline and extensive validation.

In contrast, LatentVLA takes a totally wrong way: the authors argue that raw driving data already accommodates the structure required to coach a big model and that natural language is inherently biased and difficult to align with actions. Further, generating natural language reasoning chains is inefficient since some tokens don’t contribute meaningfully to the reasoning process (e.g. stop words).

Due to this fact, they introduce a self-supervised framework employed to predict in a small latent space. In other words, the model uses unlabelled driving data to predict to generate this data. These latent actions will serve because the constructing blocks for latent-space reasoning.

Representation Learning

To predict latent actions from unlabeled data, the authors use a technique paying homage to LAPO (learning to act without actions) [2]. This approach relies on a encoder-decoder setup where the encoder (also called “inverse-dynamics model”, IDM) uses two subsequent frames to predict a continuous motion vector and the decoder (known as “forward dynamics model”, FDM) uses the present frame and the expected motion vector to reconstruct the following frame.

This clever setup forces the learned motion representation to explain what to look at the state transitions in our datasetHowever, this continuous motion representation remains to be incompatible with the VLMs we intend to make use of. To discretise it, the authors use a VQ-VAE (Vector-Quantised Variational Auto-Encoder), which maps continuous vectors to the closest discrete vectors in a learned (i.e. a dictionary of discrete actions) in a differentiable way. That is the motion that can be utilized by the FDM to decode the following frame.

By optimising the next-frame reconstruction error, we jointly trained the IDM and FDM to encode a predictive discrete motion representation.

Continuous motion representations learned by LAPO from unlabeled gameplay videos on popular arcade games. Source: [2]

Distinguishing Ego-Actions from Environmental Noise

Now you may think: “The driving force’s actions aren’t the one factor influencing the following frame when driving, what if a bird flies in front of the camera? Does this pollute the motion representation?”. To this, the authors respond yes no, there must be a mechanism that disentangles the impact of the driving force’s actions on the longer term from

The elegant solution to this problem is to make use of a two-stage encoder-decoder setup:

Conditioned on the ground-truth trajectory, ego-state and former frame, the encoder predicts a latent motion. Since this motion is conditioned on vehicle dynamics through the trajectory and ego-state, it only must model to enable the decoder to reconstruct the following frame. This is then quantised and the codebook used to this end is frozen for the following stage.
Conditioned on the previous frame and the , the encoder encodes one other latent motion. Similarly, for the reason that environmental dynamics are known and a part of the conditioning, this second latent motion is forced to encode . Using a brand new codebook, this motion is quantised right into a discrete .

Finally, we feed each actions to the decoder to reconstruct the following frame. This setup ensures a transparent separation of ego-actions and environmental dynamics.

VLM Training

Constructing on the learned motion representation, the authors train a Qwen2.5-VL model to predict the identical latent actions because the encoder-decoder model. That is achieved by having the encoder predict a trajectory of 12 latent actions for a given input frame and having the VLM optimising its negative log likelihood:

A striking difference with other approaches employing motion codebooks is the variety of actions tokens utilized by LatentVLA. Where other models like AutoVLA use an motion codebook of 2048 special tokens, LatentVLA only uses 16.

This ends in:

An easier learning task: in a 2048-dimensional codebook, actions probably represent very precise driving decisions like “steer left at a 16-degree angle”. With only 16 tokens, the model probably adopts higher-level directives like “speed up barely”, “take a narrow right turn”, which require less demonstrations to learn.
Preserving the VLM’s pre-training knowledge: it doesn’t need to learn over 2000 “latest words”.

Knowledge Distillation

Where AlpamayoR1 relied on efficient tokenisation and flow-matching diffusion to take care of real-time performance, LatentVLA goes for a totally different approach: knowledge distillation. To this end, the authors introduce a fusion module inside existing E2E architectures (iPad [4] and Transfuser [5]). This fusion module is fed visual and motion embeddings by the VLM and outputs features in Bird’s-Eye-View (BEV) space. These embeddings function keys and values in cross-attention with BEV queries produced by the E2E model. This permits E2E model to integrate insights from the VLM.

LatentVLA integrates with several E2E architectures, for simplicity, we only have a look at the Transfuser integration. Source: [1]

Nevertheless, the VLM stays too large for use efficiently at test-time. Due to this fact, a small 50M-parameter decision transformer is trained to mimic the big 3.8B Qwen2.5-VL VLM. That is achieved by minimising the KL divergence between the teacher and student distributions:

This framework enables LatentVLA to operate with a really compact reasoning backbone and provides a general approach to integrating VLM knowledge into traditional E2E architectures at a lesser cost.

Visual representation of the LatentVLA architecture with knowledge distillation. Source: [1]

Evaluation

LatentVLA is trained and evaluated on NavSim [6], a dataset composed of over 100.000 frames collected in real-world driving simulations. NavSim also features a simulator to judge open-loop planning.

In other words, the models predicts a trajectory over the following few seconds given input images. Then, this trajectory is executed in a BEV simulation operating on the belief that actions of the ego-vehicle the actions of other agents (thus “non-reactive”). This permits to simply measure planning-related metrics equivalent to the Predictive Driver Model Rating (PDMS): a composite metric that quantifies driving safety, performance, and risk by integrating simulation outputs.

Nevertheless, this sort of evaluation has some essential shortcomings, as we’ll discuss later.

Representation of a NavSim scene (left) together with a simulation rollout (right). Source: [1]

On this benchmark, LatentVLA obtains state-of-the-art results, improving upon standard E2E and LLM-based architectures. Nevertheless, the performance increase obtained by integrating VLM knowledge into iPad and Transfuser seems limited. Specializing in the PDMS, we observe that the iPad baseline obtains a rating of 91.7%. The distilled LatentVLA alternative increases the rating to 92.1 (+0.4%) and the non-distilled version reaches 92.4 (one other +0.3%).

This small improvement begs the query whether higher-level reasoning and world knowledge really are essential to driving.

For my part they’ve the potential to unlock a brand new level of driving performances, but that is poorly measured by non-interactive planning simulators.

The restrictions of open-source planning

Lately, it has develop into widely accepted that only evaluating driving models on open loop planning gives an incomplete picture of their real driving abilities. Indeed, open-loop planning is fundamentally different from driving and arguably easier. The most important reason being that open-loop planning doesn’t involve interactions with the environment (the simulator is at best ) and reduces to imitating the trajectory of an authority. This creates multiple problems in real scenarios:

Small deviations from the learned trajectories result in cascading errors: without dynamic interactions with the environment and other agents, open-loop models struggle to rectify trajectories which might be barely misaligned with ones they learned.
Trajectories are inherently multimodal: for every driving situation, there exist multiple trajectories and acceleration patterns resulting in protected driving outcomes. Nevertheless, imitation learning on a single expert trajectory collapses this multi-modality, limiting the generalisation capabilities of the model.

For these reasons, it is crucial to thoroughly evaluate driving models in closed-loop (i.e. reactive) simulators and warrants the usage of RL post-training methods as discussed within the AR1 article.

I’d bet that the discrepancy between LatentVLA and its non-VLM baselines is larger in these scenarios as reasoning could help alleviating the constraints of open-loop training.

Conclusion

In this text, we discussed LatentVLA, an approach aiming to integrate VLM knowledge into standard E2E models without counting on natural language. This approach is revolutionary within the sense that it enables learning useful representations from unlabeled data whereas competing works like AR1 depend on rigorously annotated large-scale datasets to avoid the paradox of natural language.

Nevertheless, LatentVLA would profit from more thorough evaluation, particularly in closed-loop settings.

Thanks for reading this far!

If you happen to found this text useful, please consider sharing it; it genuinely helps support the effort and time that goes into producing this work. As at all times, be happy to contact me if you’ve gotten questions, thoughts, or ideas for follow-ups. If you happen to’d wish to support my independent research and writing, be happy to buy me a coffee 😉

Until next time! 👋

LatentVLA: Latent Reasoning Models for Autonomous Driving

Latent Motion Learning

Representation Learning

Distinguishing Ego-Actions from Environmental Noise

VLM Training

Knowledge Distillation

Evaluation

The restrictions of open-source planning

Conclusion

Thanks for reading this far!

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How AI is popping the Iran conflict into theater

Removing the Guesswork from Disaggregated Serving

Why Your AI Search Evaluation Is Probably Flawed (And The right way to Fix It)

OpenAI’s robotics lead exits over Pentagon deal

Improving AI models’ ability to elucidate their predictions

LatentVLA: Latent Reasoning Models for Autonomous Driving

Latent Motion Learning

Representation Learning

Distinguishing Ego-Actions from Environmental Noise

VLM Training

Knowledge Distillation

Evaluation

The restrictions of open-source planning

Conclusion

Thanks for reading this far!

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.