Dataset Recording, VLA High quality‑Tuning, and On‑Device Optimizations

-


Gaetan Bahl's avatar
pilati-nxp's avatar


blog_image


Recent advances in Large Language Models have enabled the transition from text-only reasoning to multimodal systems. First, with the mixing of visual perception in Vision–Language Models (VLMs), and more recently with the generation of robot actions in Vision–Language–Motion (VLA) models. Deploying these models on embedded robotic platforms stays a challenge as a consequence of tight constraints by way of compute, memory, and power, in addition to real-time control requirements.

In synchronous control pipelines, while the VLA is running inference, the arm is idle awaiting commands resulting in oscillatory behavior and delayed corrections. To tackle that, asynchronous Inference can enable smooth and continuous motion by dissociating generation from execution. Nevertheless, to be effective, the end-to-end inference latency must remain shorter than the motion execution duration. This temporal constraint due to this fact sets an upper limit on the model’s throughput.

Bringing VLA models to embedded platforms just isn’t a matter of model compression, but a fancy systems engineering problem requiring architectural decomposition, latency-aware scheduling, and hardware-aligned execution. Addressing these challenges is important to translate recent advances in multimodal foundation models into practical and deployable embedded robotic systems.

This guide presents NXP’s hands‑on best practices for recording reliable robotic datasets, nice‑tuning VLA policies (ACT and SmolVLA), and hightlights the real-time performance that NXP i.MX95 achieves after optimization.




🎥 Dataset Recording: What Actually Matters

High‑quality, consistent data beats “more but messy” data. This section turns hard‑earned lessons into concrete checklists and schemas.

In our case, we recorded a dataset for the duty: “Put the tea bag within the mug.”



1) Consistency First

  • Fixed cameras: Use rigid mounts to avoid pose drift. If during recording or evaluation a number of cameras shift due to robot’s vibrations or the operator resetting the environment, you possibly can observe a severe accuracy loss.
  • Controlled lighting: Arrange your environment where you possibly can have as much control as possible on lighting (Fixed light source(s) and removed from sunlight that adjust in the course of the day).
  • Strong contrast: Avoid training with “white on white” unless that’s your deployment domain. Maximize contrast between the arm, the article and the environment.
  • Fixed calibration: Be certain to have backups of your robot and teleoperator calibrations so that you do not have to re-record your previous episodes if the code crashes.
  • Don’t cheat: Don’t use information the model is not going to have access to at inference time. During data recording, it’s tempting for the operator to depend on direct visual remark of the scene. Nevertheless, this introduces information that’s absent from the dataset. Dataset collection have to be restricted to the identical camera inputs that will likely be available to the policy at runtime.



2) Use a Gripper Camera (Highly Advisable)

Moving from scene‑only views to mixed viewpoints increases the worldwide accuracy, however the more cameras you could have the more the latency is impacted. Due to this fact, you need to select right compromise. In our case that balance was reached with 3 cameras:

Top Gripper Left
top gripper left
The worldwide view of the entire scene. The closest view for precise grasps and alignment. Complement the highest view for height and depth.

We strongly recommend using a gripper-mounted camera. It consistently improves success rates on nice manipulation tasks by providing an in depth, task-relevant viewpoint. Importantly, it is usually the camera that the majority effectively enforces correct data collection practices, allowing the operator to rely exclusively on the robot’s perception slightly than observing the scene directly.

When installing a gripper camera, we recommend securing the cable with Velcro or a strain-relief guide to stop it from obstructing the sphere of view or becoming disconnected during motion.



3) Improve Prehension

heat_shrink-tube

Easy hardware tweaks like heat‑shrink tubing over gripper claws increase friction, reduce roughness, reduce slippage during episodes, and increase task success rate (fewer “almost success” episodes), improving policy learning stability.



4) Diversity & Splits

clusters

When recording a dataset, you must:

  • Vary episodes distribution: Divide your workspace into starting-position clusters, and record at the very least 10 episodes per cluster. Add diversity by changing the article position and rotation.

    we partitioned the robot arm’s reachable workspace into 11 clusters, each measuring 10 × 10 cm.

  • Differentiate training & validation sets: Policies can easily overfit on the training set, so be sure that that the validation set is unseen by the model.

    we removed cluster 6 from the training set.

  • Record essentially the most movements you possibly can: Small VLA models exhibit limited generalization on unseen motion. Due to this fact, record episodes that cover the broader ranges of degrees of freedom.

    we grasped the tea bag either in horizontal or vertical position.

  • Anticipate failure: Sometimes the policy is not going to reach the article the primary time and could have to “return to it”. We noticed that having 20% of all episodes that corresponds to the case of going back to the article help the model improve overall success rate.

    around 20% of our training set corresponds to recovery episodes.

This mirrors best practices across VLA papers and community guides. Listed here are 3 examples of knowledge diversity inside the same cluster:

Starting position 1 Starting position 2 Recovery episode
cluster_10_1 cluster_10_2 recovery

Starting positions 1 and a pair of correspond to different positions inside the same cluster. In contrast, in the course of the recovery episode, the robot doesn’t begin in “starting mode”; but is as an alternative already near the mug and may proceed on to retrieve the tea bag from that location.




🎛️ High quality‑Tuning VLAs

act_loss

What we did in practice:

  • Tasks: “Grab the tea bag and place it within the mug.”
  • Dataset:
    • 120 episodes: 10 clusters x (10 different tea bag starting positions + 2 recovery episodes)
    • 3 cameras (640x480px, 30fps): Top, Gripper, Left
    • Cluster n°6 was removed for validation
  • Batch size: 8
  • Training: Model checkpoint with the bottom validation loss after 200k steps was chosen

The range providing one of the best trade-off between accuracy, generalization, and motion smoothness across each the training and validation sets was found for ACT (100 actions per chunk) inside a 100k-160k training steps.
For SMolVLA training (50 actions per chunk), the trade‑off appears after many more training steps. We found that continuing training barely past the purpose where the model begins to overfit tends to enhance overall accuracy.

Rule of thumb: select final checkpoint by evaluating success on each training and validation set, not by training loss.




⚡ Optimizing for NXP i.MX95

The i.MX95 integrates 6× Arm Cortex‑A55, Cortex‑M7/M33, a Mali GPU, latest ISP, and the eIQ® Neutron NPU, targeting efficient, secure edge inference with multi‑camera support and powerful I/O. [nxp.com]



1) Divide And Conquer

As an alternative of running the models as one monolithic graph, we decompose the VLA graph into logical stages: encoders, decoders, and motion experts. Due to this fact, allowing each component to be optimized, scheduled, and deployed independently.

In practice, SmolVLA is partitioned into the next sub-blocks:

  • Vision: processes RGB camera frames and produces visual embeddings.
  • LLM backbone: generates actions tokens from visual and textual embeddings.
  • Motion expert: applies flow matching to iteratively denoise motion samples and outputs final control commands.

This separation allows per-block optimizations. The impact of every block quantization may be measured to decide on one of the best tradeoff between latency and accuracy. Also, isolating the motion expert from the VLM was ideal to run it at lower frequency.



2) Quantization

As a way to optimize the inference for i.MX95, we explored several quantization techniques on different blocks. We found that quantizing the vision encoder and LLM prefill had limited impact on accuracy, whereas quantization of the denoising flow within the motion expert significantly degrades performance.
This behaviour is predicted, as quantization errors are accumulating across iterative denoising steps.

That’s the reason we decided to maintain this block at higher precision to preserve stability, while on the opposite blocks, we explored various quantization configurations, from 8-bit mixed precision to 4-bit quantization, depending on the layers.

As well as, we applied in-house optimization on the various blocks. Results are shown within the below table, referred as optimized models.



3) Asynchronous Inference: Control-Aware Scheduling

In a synchronous control loop, the pipeline operates as:

  1. Capture remark
  2. Run full model inference
  3. Execute generated motion

During step (2), the robot stays idle. If inference latency is non-negligible, this produces:

  • Idle gaps in motion
  • Oscillatory corrections as a consequence of stale observations
  • Reduced effective control frequency
  • Poor recovery behavior

With Asynchronous Inference, motion generation runs in parallel with execution:

  • The robot executes the present motion chunk
  • The subsequent chunk is computed concurrently

This increases effective control frequency, reduces remark staleness, and improves recovery behavior.

On embedded platforms similar to i.MX95, asynchronous inference is important — but only effective if inference latency is kept under the motion horizon budget: $T_{text{inference}} < T_{text{execution}}$

Synchronous inference Asynchronous inference
Actions per chunk 100 100
FPS 60 60
Chunk size threshold N/A 0.2
Aggregate function N/A weighted_average
Motion queue evolution async_g_0 async_g_02
Results



📊 What We Achieve on i.MX95

imx95

Setup

  • Tasks: “Grab the tea bag and place it within the mug.”
  • Test set (20 episodes): 2 random positions for every cluster.
  • Validation set (10 episodes): all 10 positions in cluster n°6
Platform (CPU) Policy Format Inference Latency Accuracy Test Set (20) Accuracy Validation Set (10) Global Accuracy (30)
i.MX 95 ACT ONNX FP32 2.86 s 1.00 0.90 0.96
i.MX 95 ACT Optimized 0.32 s 1.00 0.60 0.89
i.MX 95 SmolVLA ONNX FP32 29.1 s 0.50 0.40 0.47



⏩ Next Steps

Our immediate objective is to enhance task accuracy with SmolVLA (ONNX FP32). Now we have already established a baseline and measured an optimized on-board inference latency of 6.15 s.

The subsequent phase will concentrate on deeper optimizations on our NPUs. In parallel, we aim to maneuver from single-task setup toward longer-horizon and more complex scenarios. To do this, we are going to introduce:

  • Simulation environments for scalable data generation and benchmarking
  • Reinforcement Learning (RL) for policy refinement
  • Sim-to-Real transfer to bridge domain gaps and improve real-world performance

The goal is to maneuver from a single validated manipulation task toward a reproducible methodology for deploying VLA policies on embedded robotic systems.




✅ Checklists You Can Reuse

Recording

Training

Deployment on i.MX95




📚 Resources & Inspiration



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x