took the world of autonomous driving by storm with their recent AlpamayoR1 architecture integrating a big Vision-Language Model as a causally-grounded reasoning backbone. This release, accompanied by a brand new large-scale dataset and a photo-realistic driving simulator, already positions the corporate as one in every of the major players in the sphere in 2026.
In this text, we’ll break down the AlpamayoR1 architecture, chain of causation reasoning, in addition to the frilly training procedure used to coach the model.
The Current State of Autonomous Driving
The discharge of AlpamayoR1 (AR1) finds context in the present paradigm of End-to-End (E2E) architectures. E2E models aim to map raw sensory inputs (cameras, LiDAR, radar, …) to trajectories in a totally differentiable architecture optimising a unified objective.
An emerging trend in E2E involves leveraging the extensive world knowledge of huge Vision-Language Models (VLMs) to tackle complex driving situations. This generally involves using VLMs as reasoning backbones to tell future trajectories or as expert teachers to offer supervisory signal to smaller student models.
The AR1 Architecture
AR1 is a chief example of the reasoning-VLM-as-a-backbone approach. Despite its massive size, the architecture is optimised for real-world deployment and runs a latency of 99ms or 10Hz on a single BlackWell GPU, which is taken into account to be a general goal for safety reasons. On this section, we’ll break down the architecture and its quite a few innovations.
Vision Encoder
AR1 uses each visual and textual inputs in the shape of tokenised camera feeds and natural language instructions. For performance, it’s crucial for the vision encoder to provide as few tokens as possible.
To this end, the authors used a Vision Transformer (ViT)[2] for single-image tokenisation. ViTs partition images in a sequence of tokens encoded by a daily transformer. Note that the combination of more efficient algorithms like Flex [3] for multi-video tokenisation is left for future work.
![Vision Transformer architecture, source: [2]](https://contributor.insightmediagroup.io/wp-content/uploads/2026/02/image-59-1024x572.png)
Reasoning Backbone
The AR1 architecture is built around Cosmos-Reason, one in every of Nvidia’s VLMs trained specifically for embodied reasoning in Physical AI use cases. Its usual training set includes 3.7M general Visual Query-Answering (VQA) samples to enhance the model’s physical common set as well, complemented by 24.7K driving samples. These include video VQA annotated with DeepSeek-R1 reasoning traces to predict the subsequent motion.
Cosmos-Reason processes visual and text tokens together with the recent ego-history (past x-y positions and angle of the ego-vehicle) to output chain of causation reasoning traces to tell future trajectories.
Chain of Causation
An important limitation of language models lies within the inherent ambiguity of text labels in visual datasets. This includes vague descriptions lacking a causal structure. Models trained on such data exhibit a low correlation between their reasoning traces and predicted actions in addition to causal confusion.

For an embodied agent like an autonomous automotive, strong causal reasoning abilities are essential. To avoid those problems, the Nvidia team deployed significant efforts to create a driving dataset with causally consistent annotations.
Specifically, the dataset comprises 20-second clips extracted from real-world driving recordings in various environments and countries. Each clip comprises 2 seconds of context resulting in a driving decision (e.g. overtaking, yielding, passing an intersection, …) and its consequences. The causal structure of those scenarios is exposed by consistent textual annotations following a strict template.

The primary 10% of the dataset are annotated by humans, while the rest are annotated by state-of-the-art VLMs like GPT5 to scale the labeling process. Once more, significant efforts are deployed to make sure the consistency, quality and correctness of those human and AI annotations.

Trajectory Decoder
The last step of the forward pass consists in decoding the reasoning traces right into a 64 point trajectory. While trajectories are frequently decoded as a sequence of waypoints (x-y coordinates), the Nvidia team found that using unicycle dynamics (i.e. generating a sequence of acceleration values and steering angles) produced more consistent results. Specifically, it facilitates the training task by stopping the model from predicting physically inconceivable trajectories (e.g. point t being too removed from point t+1).
Interestingly, the authors adopt a dual representation of the trajectory where the model auto-regressively generates discrete tokens during training and uses flow-matching to generate a continuous trajectory at inference time. The major reasons behind this design are as follows:
- Joint Motion-Reasoning Token Space: Using discrete motion tokens allows for a tighter coupling between reasoning traces and actions. When the model generates a reasoning trace, the subsequent tokens within the sequence are (acceleration and curvatures) are mathematically linked to that explanation, stopping hallucinations.
- Facilitating RL Optimisation: Restricting the set of possible motion tokens to a discrete set makes RL optimisation significantly easier. Indeed, sampling the right token from a discrete vocabulary (e.g.
ACCEL_NEG_2) is significantly easier than providing a gradient for a continuous value like-2.145 m/s^2. As we’ll see in the subsequent section, this permits RL post-training, which is crucial to enhance the model’s safety and consistency. - Stronger Supervisory Signal: Using a cross-entropy loss on discrete tokens acts like a classification task and higher captures the (e.g. the distinct probability of turning left or right) than an MSE loss on coordinates.
- Flow Matching for Inference: While discrete tokens are great for learning, they typically end in jerky trajectories. Furthermore, generating a sequence of 128 tokens auto-regressively is just too slow for real-time inference. To handle those limitations, the authors introduce an motion expert: a smaller variant of the major architecture using the KV cache (which comprises visual tokens, historical motions and reasoning traces) to decode a continuous trajectory in a single pass using flow-matching diffusion. That is one in every of the major explanation why AR1 can run at such low latency.

Supervised Wonderful-Tuning and RL Post-Training

In an effort to transform the VLM backbone right into a performant driving policy, it undergoes supervised fine-tuning (SFT) on the chain of causation dataset. Specifically, it learns to breed the reasoning traces and associated ground-truth actions by maximising the log-likelihood of the action-reasoning sequence:
Nonetheless, SFT by itself is just not enough. VLMs are notoriously affected by discrepancies between their reasoning and predicted actions. The static nature of open-loop datasets allows the model to mimic reasoning traces, but the shortage of environmental feedback prevents them from truly internalising causal reactions.
Fortunately, RL post-training helps alleviate those limitations by providing inference feedback on the model’s rollouts. On this paper, the authors use RL for 3 major purposes:
- Improving reasoning quality: a big reasoning model (e.g. DeepSeek-R1) evaluates AR1’s reasoning traces to make sure there aren’t any inconsistencies or hallucinations and assigns a discrete reward on a scale of 0 to five accordingly. While DeepSeek is just not expected to have the option to generate high-quality reasoning traces for driving, it’s significantly easier to judge AR1’s reasoning, that is generally known as the
- Enforcing reasoning-action consistency: the authors extract (speed up, steer, go straight, …) from the CoC dataset using rule-based systems. If those meta-actions correspond to those mentioned within the reasoning traces, the model receives an extra reward of 1, otherwise 0.
- Trajectory Quality: a trajectory reward measures the L2 distance between the anticipated and expert trajectory, penalises trajectories resulting in collisions and high-magnitude jerks.
During post-training, AR1 generates multiple parallel rollouts and collects rewards r_i based on the three reward signals above. These rewards are then used to compute the GRPO loss [4]. GRPO computes the advantage of every rollout relative to the group average. This baseline-free approach (versus other RL algorithms like PPO), stabilises training by rewarding reasoning paths that outperform their counterparts for a similar input, moderately than counting on an arbitrary absolute rating.
All it’s good to understand about this objective is that it goals to maximise the probability of trajectories (the log term) with a high advantage (the softmax term) relative to others. To avoid losing vision-language priors from the VLM and the driving knowledge obtained during SFT, the target is regularised by a KL divergence between the present policy and the reference (the policy obtained at the tip of SFT).
Evaluation
The evaluation protocol includes 4 sections: Open-loop trajectory prediction, closed-loop simulation, ablation studies and on-vehicle road tests. While the incontrovertible fact that AR1 was deployed in real-world scenarios is impressive, the open and closed-loop results are somewhat opaque ; the major reason being that they were obtained on Nvidia datasets (closed loop: PhysicalAI-AV dataset, closed-loop: AlpaSim) released concurrently the model. This suggests a scarcity of baselines to contextualise AR1’s performances.
As an illustration, the closed-loop results only feature AR1 and a non-reasoning baseline on 75 scenarios. While AR1 outperforms the baseline on all measured metrics, it often does so by a single percent on average and with a much larger variance than the baseline.

For that reason, I might advise taking these results with a grain of salt before other frontier architectures are evaluated in AlpaSim.
Conclusion
Despite the shortage of contextualised results, AR1 and the accompanying datasets remain a powerful engineering achievement and a great indication of where autonomous driving is headed: end-to-end models inheriting world knowledge from massive VLMs trained on embodied tasks.
Nonetheless, the gathering of causally-grounded datasets required to enable chain of causation require significant investments and labeling efforts which limits reproducibility In my next article, I’ll contrast the AR1 approach with one other state-of-the-art model which entirely disposes textual labels and as a substitute trains VLMs to act and reason in a latent space.
Thanks for reading this far!
In the event you found this text useful, please consider sharing it; it genuinely helps support the effort and time that goes into producing this work. As at all times, be at liberty to contact me if you could have questions, thoughts, or ideas for follow-ups. In the event you’d wish to support my independent research and writing, be at liberty to buy me a coffee 😉
Until next time! 👋
