Dataset Recording, VLA High quality‑Tuning, and On‑Device Optimizations

Recent advances in Large Language Models have enabled the transition from text-only reasoning to multimodal systems. First, with the mixing of visual perception in Vision–Language Models (VLMs), and more recently with the generation of robot actions in Vision–Language–Motion (VLA) models. Deploying these models on embedded robotic platforms stays a challenge as a consequence of tight constraints by way of compute, memory, and power, in addition to real-time control requirements.

In synchronous control pipelines, while the VLA is running inference, the arm is idle awaiting commands resulting in oscillatory behavior and delayed corrections. To tackle that, asynchronous Inference can enable smooth and continuous motion by dissociating generation from execution. Nevertheless, to be effective, the end-to-end inference latency must remain shorter than the motion execution duration. This temporal constraint due to this fact sets an upper limit on the model’s throughput.

Bringing VLA models to embedded platforms just isn’t a matter of model compression, but a fancy systems engineering problem requiring architectural decomposition, latency-aware scheduling, and hardware-aligned execution. Addressing these challenges is important to translate recent advances in multimodal foundation models into practical and deployable embedded robotic systems.

This guide presents NXP’s hands‑on best practices for recording reliable robotic datasets, nice‑tuning VLA policies (ACT and SmolVLA), and hightlights the real-time performance that NXP i.MX95 achieves after optimization.

🎥 Dataset Recording: What Actually Matters

High‑quality, consistent data beats “more but messy” data. This section turns hard‑earned lessons into concrete checklists and schemas.

In our case, we recorded a dataset for the duty: “Put the tea bag within the mug.”

1) Consistency First

Fixed cameras: Use rigid mounts to avoid pose drift. If during recording or evaluation a number of cameras shift due to robot’s vibrations or the operator resetting the environment, you possibly can observe a severe accuracy loss.
Controlled lighting: Arrange your environment where you possibly can have as much control as possible on lighting (Fixed light source(s) and removed from sunlight that adjust in the course of the day).
Strong contrast: Avoid training with “white on white” unless that’s your deployment domain. Maximize contrast between the arm, the article and the environment.
Fixed calibration: Be certain to have backups of your robot and teleoperator calibrations so that you do not have to re-record your previous episodes if the code crashes.
Don’t cheat: Don’t use information the model is not going to have access to at inference time. During data recording, it’s tempting for the operator to depend on direct visual remark of the scene. Nevertheless, this introduces information that’s absent from the dataset. Dataset collection have to be restricted to the identical camera inputs that will likely be available to the policy at runtime.

2) Use a Gripper Camera (Highly Advisable)

Moving from scene‑only views to mixed viewpoints increases the worldwide accuracy, however the more cameras you could have the more the latency is impacted. Due to this fact, you need to select right compromise. In our case that balance was reached with 3 cameras:

Top	Gripper	Left

The worldwide view of the entire scene.	The closest view for precise grasps and alignment.	Complement the highest view for height and depth.

We strongly recommend using a gripper-mounted camera. It consistently improves success rates on nice manipulation tasks by providing an in depth, task-relevant viewpoint. Importantly, it is usually the camera that the majority effectively enforces correct data collection practices, allowing the operator to rely exclusively on the robot’s perception slightly than observing the scene directly.

When installing a gripper camera, we recommend securing the cable with Velcro or a strain-relief guide to stop it from obstructing the sphere of view or becoming disconnected during motion.

3) Improve Prehension

Easy hardware tweaks like heat‑shrink tubing over gripper claws increase friction, reduce roughness, reduce slippage during episodes, and increase task success rate (fewer “almost success” episodes), improving policy learning stability.

4) Diversity & Splits

When recording a dataset, you must:

Vary episodes distribution: Divide your workspace into starting-position clusters, and record at the very least 10 episodes per cluster. Add diversity by changing the article position and rotation.

we partitioned the robot arm’s reachable workspace into 11 clusters, each measuring 10 × 10 cm.
Differentiate training & validation sets: Policies can easily overfit on the training set, so be sure that that the validation set is unseen by the model.

we removed cluster 6 from the training set.
Record essentially the most movements you possibly can: Small VLA models exhibit limited generalization on unseen motion. Due to this fact, record episodes that cover the broader ranges of degrees of freedom.

we grasped the tea bag either in horizontal or vertical position.
Anticipate failure: Sometimes the policy is not going to reach the article the primary time and could have to “return to it”. We noticed that having 20% of all episodes that corresponds to the case of going back to the article help the model improve overall success rate.

around 20% of our training set corresponds to recovery episodes.

This mirrors best practices across VLA papers and community guides. Listed here are 3 examples of knowledge diversity inside the same cluster:

Starting position 1	Starting position 2	Recovery episode

Starting positions 1 and a pair of correspond to different positions inside the same cluster. In contrast, in the course of the recovery episode, the robot doesn’t begin in “starting mode”; but is as an alternative already near the mug and may proceed on to retrieve the tea bag from that location.

🎛️ High quality‑Tuning VLAs

What we did in practice:

Tasks: “Grab the tea bag and place it within the mug.”
Dataset:
- 120 episodes: 10 clusters x (10 different tea bag starting positions + 2 recovery episodes)
- 3 cameras (640x480px, 30fps): Top, Gripper, Left
- Cluster n°6 was removed for validation
Batch size: 8
Training: Model checkpoint with the bottom validation loss after 200k steps was chosen

The range providing one of the best trade-off between accuracy, generalization, and motion smoothness across each the training and validation sets was found for ACT (100 actions per chunk) inside a 100k-160k training steps.
For SMolVLA training (50 actions per chunk), the trade‑off appears after many more training steps. We found that continuing training barely past the purpose where the model begins to overfit tends to enhance overall accuracy.

Rule of thumb: select final checkpoint by evaluating success on each training and validation set, not by training loss.

⚡ Optimizing for NXP i.MX95

The i.MX95 integrates 6× Arm Cortex‑A55, Cortex‑M7/M33, a Mali GPU, latest ISP, and the eIQ® Neutron NPU, targeting efficient, secure edge inference with multi‑camera support and powerful I/O. [nxp.com]

1) Divide And Conquer

As an alternative of running the models as one monolithic graph, we decompose the VLA graph into logical stages: encoders, decoders, and motion experts. Due to this fact, allowing each component to be optimized, scheduled, and deployed independently.

In practice, SmolVLA is partitioned into the next sub-blocks:

Vision: processes RGB camera frames and produces visual embeddings.
LLM backbone: generates actions tokens from visual and textual embeddings.
Motion expert: applies flow matching to iteratively denoise motion samples and outputs final control commands.

This separation allows per-block optimizations. The impact of every block quantization may be measured to decide on one of the best tradeoff between latency and accuracy. Also, isolating the motion expert from the VLM was ideal to run it at lower frequency.

2) Quantization

As a way to optimize the inference for i.MX95, we explored several quantization techniques on different blocks. We found that quantizing the vision encoder and LLM prefill had limited impact on accuracy, whereas quantization of the denoising flow within the motion expert significantly degrades performance.
This behaviour is predicted, as quantization errors are accumulating across iterative denoising steps.

That’s the reason we decided to maintain this block at higher precision to preserve stability, while on the opposite blocks, we explored various quantization configurations, from 8-bit mixed precision to 4-bit quantization, depending on the layers.

As well as, we applied in-house optimization on the various blocks. Results are shown within the below table, referred as optimized models.

3) Asynchronous Inference: Control-Aware Scheduling

In a synchronous control loop, the pipeline operates as:

Capture remark
Run full model inference
Execute generated motion

During step (2), the robot stays idle. If inference latency is non-negligible, this produces:

Idle gaps in motion
Oscillatory corrections as a consequence of stale observations
Reduced effective control frequency
Poor recovery behavior

With Asynchronous Inference, motion generation runs in parallel with execution:

The robot executes the present motion chunk
The subsequent chunk is computed concurrently

This increases effective control frequency, reduces remark staleness, and improves recovery behavior.

On embedded platforms similar to i.MX95, asynchronous inference is important — but only effective if inference latency is kept under the motion horizon budget: $T_{text{inference}} < T_{text{execution}}$

	Synchronous inference	Asynchronous inference
Actions per chunk	100	100
FPS	60	60
Chunk size threshold	N/A	0.2
Aggregate function	N/A	weighted_average
Motion queue evolution
Results

📊 What We Achieve on i.MX95

Setup

Tasks: “Grab the tea bag and place it within the mug.”
Test set (20 episodes): 2 random positions for every cluster.
Validation set (10 episodes): all 10 positions in cluster n°6

Platform (CPU)	Policy	Format	Inference Latency	Accuracy Test Set (20)	Accuracy Validation Set (10)	Global Accuracy (30)
i.MX 95	ACT	ONNX FP32	2.86 s	1.00	0.90	0.96
i.MX 95	ACT	Optimized	0.32 s	1.00	0.60	0.89
i.MX 95	SmolVLA	ONNX FP32	29.1 s	0.50	0.40	0.47

⏩ Next Steps

Our immediate objective is to enhance task accuracy with SmolVLA (ONNX FP32). Now we have already established a baseline and measured an optimized on-board inference latency of 6.15 s.

The subsequent phase will concentrate on deeper optimizations on our NPUs. In parallel, we aim to maneuver from single-task setup toward longer-horizon and more complex scenarios. To do this, we are going to introduce:

Simulation environments for scalable data generation and benchmarking
Reinforcement Learning (RL) for policy refinement
Sim-to-Real transfer to bridge domain gaps and improve real-world performance

The goal is to maneuver from a single validated manipulation task toward a reproducible methodology for deploying VLA policies on embedded robotic systems.

✅ Checklists You Can Reuse

Recording

Training

Deployment on i.MX95

📚 Resources & Inspiration

Source link

Dataset Recording, VLA High quality‑Tuning, and On‑Device Optimizations