TL;DR
Robotic policies are increasingly bulky, and predict chunks of future actions reasonably than a single next motion. This ends in the robot being idle while awaiting latest actions to perform, introducing noticeable lags at execution, and lacking responsiveness. Asynchronous inference tightens the control loop, removing lags at runtime and leading to more adaptive control by decoupling motion prediction from motion execution. On this blog post, we cover the fundamentals behind async inference, and the way it might probably be used to enhance the performance of robotic policies within the real-world.
Table of Contents
Getting began
Start with async inference by following our tutorial.
Sequential inference (first) versus async inference (second). Allowing for replanning and a tighter control loop, async inference ends in (1) attempts at recovery, and (2) a ~2x speedup in task completion. Sequential inference keeps acting out the present motion chunk even after failure to understand the article, while async inference can replan and act the brand new motion chunk. Each setups use the identical policy!
Async inference: a deep dive
With async inference, we decouple motion execution from motion prediction. This is especially relevant considering the tendency of currently popular models like [ACT], [OpenVLA], [PI0], and [SmolVLA] to be outputting chunks of actions reasonably than single actions given an statement .
Persuade yourself of this by running all these models using LeRobot.
Using chunks sequentially ends in (1) lags at runtime, impacting task execution time and (2) lack of responsiveness, attributable to acting widely open-loop.
Asynchronous inference mitigates each these limitations by decoupling motion prediction from motion execution.
We introduced asynchronous inference in SmolVLA, and located it to end in a ~2x speed-up in task completion time with comparable task success rate.
Particularly, we design a 2-component system where policy inference and motion execution are performed in two different processes, possibly on two different machines connected through the network:
- A
PolicyServer, hosted on accelerated hardware and able to running inference using more computational resources than those allocated on a real-world robot. - A
RobotClientenqueues the received actions and executes them while the subsequent chunk is being computed.
Communication between PolicyServer and RobotClient relies on gRPC, which guarantees ~5× faster performance than a comparable REST API. The results of all of it is a robot that never waits for inference.
Asynchronous inference, highlighting: (1) The client sending the primary statement for inference, receiving the primary chunk shortly after; (2) The client sending one other statement for processing while it has not yet exhausted the present chunk; (3) The client receiving an updated motion chunk, which it aggregates with the remainders of the one it was previously executing.
1. Why sequential inference falls short
Suppose a policy maps the present statement to a sequence of future actions.
Formally, .
A conventional control loop would due to this fact consist of the next steps:
- Capture .
- Run to acquire .
- Enqueue and begin acting popping actions from the queue.
- If the queue is empty, wait for , otherwise repeat step 3.
During step 2 the robot is idle. The latency grows with the model size (and models are inclined to be increasingly bulky over time), and might quickly dominate interaction time (which is often around 1/fps), as shown within the video below (coming from our Discord community 🤗):
This directly ends in (1) reduced performance by way of task completion time—the robot must be waiting for the subsequent motion chunk to be computed—and (2) reduced responsiveness, attributable to (2.1) acting widely open-loop while actions can be found and (2.2) complete idleness while waiting for the subsequent motion chunk.

(Left)Sequential inference with highlighted idle periods. (Right)Time to pick out an motion showing spikes when inference is triggered attributable to local queue exhaustion (inference latency is around ~100ms—~3 frames at 30fps—using an ACT model on a 2021 MacBook Pro).
2. Asynchronous inference, in a nutshell
Our system removes the idle period by overlapping computation and execution:
RobotClientstreams the most recent statement toPolicyServer.- While the server performs inference, the client executes the current queue of actions.
- Recent actions arrive, are merged into the queue, and the loop continues.
The important thing idea is that the robot already knows what to do for the subsequent few timesteps, so it might probably keep moving while fresh actions are being computed on the server.
Asynchronous inference overlaps in time the execution of the present motion chunk with the computation of the subsequent one, by decoupling these two processes, possibly running them on entirely distinct machines connected through the network.
This ends in a tighter control loop, and a robot that never waits for inference. In turn, this ends in ~2x speedup in task completion time with comparable task success rate, and more adaptive control coming from a tighter loop (see video below).
3. System Architecture
| Component | Role | Technology |
|---|---|---|
| RobotClient | Runs on-board, streams observations, maintains an motion queue, executes actions | Python, gRPC |
| PolicyServer | Hosts the policy, performs batched inference, sends motion chunks back | Python, gRPC, possibly accelerated hardware (GPU/TPU) |
Because gRPC is HTTP/2-based and uses protocol buffers, it achieves low-latency binary messaging and bidirectional streams out of the box, which in turn helps us maintain a tighter control loop and sub-100ms round-trip latency (on our local network, and hosting SmolVLA on a NVIDIA RTX 4090).
The RobotClient runs on-board, and streams observations to the PolicyServer through gRPC. The PolicyServer prepares the observations received for inference, and sends back to the RobotClient an motion chunk.
Robot Client
From the client’s perspective, observations are streamed to the server based on the local queue status. Incoming chunks are aggregated on overlapping portions with the currently available motion queue.
The RobotClient maintains a neighborhood motion queue and follows an easy yet effective strategy: send a brand new statement when the queue length drops below a configurable threshold ((g) within the SmolVLA paper, chunk_size_threshold within the code).
This threshold value, expressed as a fraction of the utmost chunk size, acts as a trigger condition that balances computational load with responsiveness.
The client streams observations to the server, based on the local queue status.
From the client’s perspective, the method unfolds as follows:
-
Queue monitoring: The client repeatedly monitors its motion queue length against a chunk size threshold parameter. When the queue drops below this threshold, it signals that a brand new statement ought to be sent for processing.
-
Statement streaming: Once the brink condition is met, the client captures the present statement and streams it to the
PolicyServervia gRPC. Crucially, observations are streamed reasonably than being sent via a unary RPC because they typically exceed the utmost message size of 4MB (multiple camera captures at high resolution end in this). -
Motion chunk aggregation: When a brand new motion chunk arrives from the server, the client merges it with any remaining actions in the present queue over the overlapping portion. That is where custom aggregators come into play, handling overlapping sections between the present and incoming chunks in another way. As of now, we support flexibly aggregation between the chunks via the specification of a custom
aggregate_fn(chunk1: torch.Tensor, chunk2: torch.Tensor) -> torch.Tensorfunction, which is known as for every overlapping timestep and might be user-provided.
The overlapping portions (shown in light blue within the diagram) require careful handling. We are able to design different aggregation strategies:- Replace: Simply replace overlapping actions with the newer predictions
- Weighted mix: Mix overlapping actions using temporal weights (closer actions get higher weight)
This technique is extremely configurable, because the chunk size threshold might be tuned based on network latency, model inference time, and desired responsiveness.
A lower threshold means more frequent updates (and better computational cost), while a better threshold reduces communication overhead on the expense of potential queue starvation.
Lastly, we typically receive actions from PolicyServer in a thread, and perform them in one other one. This keeps the client listening for incoming chunks in a separate thread, without blocking execution and at all times consuming the present chunk until a brand new one becomes fully available.
Policy Server
Upon receiving observations from the RobotClient, the PolicyServer receives observations from the RobotClient, and performs the mandatory statement cleansing to make received observations ready for inference. This process is illustrated within the image below:
The statement cleansing pipeline running on the server, highlighting the three major steps related to (1) Keys matching (2) Preprocessing and (3) Preparation for inference.
Once the statement has been prepared, it’s compared with the last statement used for inference.
This avoids collapsing right into a loop whereby very similar observations are processed, thus triggering unnecessary inference and similar actions being executed (which in turn, end in very similar observations being processed again).
We compare observations by way of their joint-space similarity, which provides us an approximate and quick way of measuring changes within the robot. Clearly, this metric is just not adaptive to dynamic changes within the environment (an object changing its position, or disturbances being applied), but we found it to be a very good trade-off for the vast majority of the cases, and to be very effective in avoiding unnecessary inference and state collapse.
Critically, the RobotClient retains control over whether a given statement have to be processed, to avoid deadlocks.
Observations sent by the client and tagged with must_go=True are processed whatever the similarity metric.
The policy workflow, during which incoming observations are in comparison with the last one used for inference, and processed provided that different enough, or `must_go`.
Lastly, to make sure the PolicyServer at all times processes the most recent available statement, we block incoming observations until the previous one has been successfully processed. On this, we leverage queues on the PolicyServer to make sure incoming observations should not enqueued until the server is able to process them (see below).
The client pings the server every 1/fps seconds, but observations should not enqueued for processing until the previous one has been successfully processed.
4. Analyzing async inference
For all practical purposes, in async inference there are two time-scales that matter:
- Environment step , depicting how briskly the robot can perform an motion.
- Inference latency : forward-pass + network round-trip. We are able to assume the network round-trip to be negligible with respect to the policy inference time, though this may not be the case for each setup.
Importantly, the ratio
results in numerous behaviours:
- : environment evolves faster than inference. On this scenario, the queue empties quickly and we degenerate to sequential control.
- : server keeps up. The queue is at all times (nearly) full.
Critically, influences the number of accessible actions within the queue at any given time. To avoid the aforementioned sequential limit control, one can:
- Use more compute for the policy server, hosting the server on a GPU, reducing as a consequence of allocating more computational resources.
- Sending observations to the server more often, send a brand new statement when the queue length drops below a fraction of its maximum size.
- reproduces sequential inference (empty queue, wait).
- sends an statement every timestep (max compute, minimal lag).
Experiments (see plots below) show that offers a very good trade-off when observations sent should not filtered out (they’re all must-go). We recommend setting and following our documentation to tune this parameter to your needs.
The number of accessible actions within the queue at any given time, as a function of g. Larger values of g end in more frequent updates, and more computational cost. Values of g closer to 0 reproduce sequential inference (empty queue, wait). We found g~0.7 to be a very good trade-off in our experiments.
5. Using async in your setup
Async inference is a straightforward yet effective technique to improve the performance of robotic policies. In our experiments using SmolVLA, async inference ends in a ~2x speedup in task completion time with comparable task success rate, and more adaptive control coming from a tighter loop.
To run your policy using async inference, you only must follow our tutorial along with your own custom parameters (e.g., the policy path or the chunk size threshold). Async inference comes with support for policies supporting motion chunking!
Conclusions
We’ve got introduced async inference, an easy yet effective technique to improve the performance of robotic policies. In our experiments using SmolVLA, async inference ends in a ~2x speedup in task completion time with comparable task success rate, and more adaptive control coming from a tighter loop.
We’re excited to share this work with the community, and to see how it might probably be used to enhance the performance of robotic policies. We welcome PRs to enhance and extend the async inference framework at huggingface/lerobot, and can be found to debate this further in our Discord community, 🤗.
