Decoupling Motion Prediction and Execution

TL;DR
Robotic policies are increasingly bulky, and predict chunks of future actions reasonably than a single next motion. This ends in the robot being idle while awaiting latest actions to perform, introducing noticeable lags at execution, and lacking responsiveness. Asynchronous inference tightens the control loop, removing lags at runtime and leading to more adaptive control by decoupling motion prediction from motion execution. On this blog post, we cover the fundamentals behind async inference, and the way it might probably be used to enhance the performance of robotic policies within the real-world.

Getting began

Start with async inference by following our tutorial.

Sequential inference (first) versus async inference (second). Allowing for replanning and a tighter control loop, async inference ends in (1) attempts at recovery, and (2) a ~2x speedup in task completion. Sequential inference keeps acting out the present motion chunk even after failure to understand the article, while async inference can replan and act the brand new motion chunk. Each setups use the identical policy!

Async inference: a deep dive

With async inference, we decouple motion execution from motion prediction. This is especially relevant considering the tendency of currently popular models like [ACT], [OpenVLA], [PI0], and [SmolVLA] to be outputting chunks of actions $a_{t:t+H}$

Using chunks sequentially ends in (1) lags at runtime, impacting task execution time and (2) lack of responsiveness, attributable to acting widely open-loop.
Asynchronous inference mitigates each these limitations by decoupling motion prediction from motion execution.
We introduced asynchronous inference in SmolVLA, and located it to end in a ~2x speed-up in task completion time with comparable task success rate.

Particularly, we design a 2-component system where policy inference and motion execution are performed in two different processes, possibly on two different machines connected through the network:

A PolicyServer, hosted on accelerated hardware and able to running inference using more computational resources than those allocated on a real-world robot.
A RobotClient enqueues the received actions and executes them while the subsequent chunk is being computed.

Communication between PolicyServer and RobotClient relies on gRPC, which guarantees ~5× faster performance than a comparable REST API. The results of all of it is a robot that never waits for inference.

Async inference scheme

Asynchronous inference, highlighting: (1) The client sending the primary statement for inference, receiving the primary chunk shortly after; (2) The client sending one other statement for processing while it has not yet exhausted the present chunk; (3) The client receiving an updated motion chunk, which it aggregates with the remainders of the one it was previously executing.

1. Why sequential inference falls short

Suppose a policy $pi$

A conventional control loop would due to this fact consist of the next steps:

Capture $o_t$
Run $pi(o_t)$
Enqueue $mathbf{A_t}$
If the queue is empty, wait for $mathbf{A}_{t+H}$

During step 2 the robot is idle. The latency grows with the model size (and models are inclined to be increasingly bulky over time), and might quickly dominate interaction time (which is often around 1/fps), as shown within the video below (coming from our Discord community 🤗):

This directly ends in (1) reduced performance by way of task completion time—the robot must be waiting for the subsequent motion chunk to be computed—and (2) reduced responsiveness, attributable to (2.1) acting widely open-loop while actions can be found and (2.2) complete idleness while waiting for the subsequent motion chunk.

Sequential inference – idle periods highlighted

Time to select action – spikes indicate inference

(Left)Sequential inference with highlighted idle periods. (Right)Time to pick out an motion showing spikes when inference is triggered attributable to local queue exhaustion (inference latency is around ~100ms—~3 frames at 30fps—using an ACT model on a 2021 MacBook Pro).

2. Asynchronous inference, in a nutshell

Our system removes the idle period by overlapping computation and execution:

RobotClient streams the most recent statement to PolicyServer.
While the server performs inference, the client executes the current queue of actions.
Recent actions arrive, are merged into the queue, and the loop continues.

The important thing idea is that the robot already knows what to do for the subsequent few timesteps, so it might probably keep moving while fresh actions are being computed on the server.

Async inference diagram

Asynchronous inference overlaps in time the execution of the present motion chunk with the computation of the subsequent one, by decoupling these two processes, possibly running them on entirely distinct machines connected through the network.

This ends in a tighter control loop, and a robot that never waits for inference. In turn, this ends in ~2x speedup in task completion time with comparable task success rate, and more adaptive control coming from a tighter loop (see video below).

3. System Architecture

Component	Role	Technology
RobotClient	Runs on-board, streams observations, maintains an motion queue, executes actions	Python, gRPC
PolicyServer	Hosts the policy, performs batched inference, sends motion chunks back	Python, gRPC, possibly accelerated hardware (GPU/TPU)

Because gRPC is HTTP/2-based and uses protocol buffers, it achieves low-latency binary messaging and bidirectional streams out of the box, which in turn helps us maintain a tighter control loop and sub-100ms round-trip latency (on our local network, and hosting SmolVLA on a NVIDIA RTX 4090).

The RobotClient runs on-board, and streams observations to the PolicyServer through gRPC. The PolicyServer prepares the observations received for inference, and sends back to the RobotClient an motion chunk.

Robot Client

From client perspective

From the client’s perspective, observations are streamed to the server based on the local queue status. Incoming chunks are aggregated on overlapping portions with the currently available motion queue.

The RobotClient maintains a neighborhood motion queue and follows an easy yet effective strategy: send a brand new statement when the queue length drops below a configurable threshold ((g) within the SmolVLA paper, chunk_size_threshold within the code).
This threshold value, expressed as a fraction of the utmost chunk size, acts as a trigger condition that balances computational load with responsiveness.

Client to server

The client streams observations to the server, based on the local queue status.

From the client’s perspective, the method unfolds as follows:

Queue monitoring: The client repeatedly monitors its motion queue length against a chunk size threshold parameter. When the queue drops below this threshold, it signals that a brand new statement ought to be sent for processing.
Statement streaming: Once the brink condition is met, the client captures the present statement and streams it to the PolicyServer via gRPC. Crucially, observations are streamed reasonably than being sent via a unary RPC because they typically exceed the utmost message size of 4MB (multiple camera captures at high resolution end in this).
Motion chunk aggregation: When a brand new motion chunk arrives from the server, the client merges it with any remaining actions in the present queue over the overlapping portion. That is where custom aggregators come into play, handling overlapping sections between the present and incoming chunks in another way. As of now, we support flexibly aggregation between the chunks via the specification of a custom aggregate_fn(chunk1: torch.Tensor, chunk2: torch.Tensor) -> torch.Tensor function, which is known as for every overlapping timestep and might be user-provided.
The overlapping portions (shown in light blue within the diagram) require careful handling. We are able to design different aggregation strategies:
- Replace: Simply replace overlapping actions with the newer predictions
- Weighted mix: Mix overlapping actions using temporal weights (closer actions get higher weight)

This technique is extremely configurable, because the chunk size threshold might be tuned based on network latency, model inference time, and desired responsiveness.
A lower threshold means more frequent updates (and better computational cost), while a better threshold reduces communication overhead on the expense of potential queue starvation.
Lastly, we typically receive actions from PolicyServer in a thread, and perform them in one other one. This keeps the client listening for incoming chunks in a separate thread, without blocking execution and at all times consuming the present chunk until a brand new one becomes fully available.

Policy Server

Upon receiving observations from the RobotClient, the PolicyServer receives observations from the RobotClient, and performs the mandatory statement cleansing to make received observations ready for inference. This process is illustrated within the image below:

Server pipeline

The statement cleansing pipeline running on the server, highlighting the three major steps related to (1) Keys matching (2) Preprocessing and (3) Preparation for inference.

Once the statement has been prepared, it’s compared with the last statement used for inference.
This avoids collapsing right into a loop whereby very similar observations are processed, thus triggering unnecessary inference and similar actions being executed (which in turn, end in very similar observations being processed again).
We compare observations by way of their joint-space similarity, which provides us an approximate and quick way of measuring changes within the robot. Clearly, this metric is just not adaptive to dynamic changes within the environment (an object changing its position, or disturbances being applied), but we found it to be a very good trade-off for the vast majority of the cases, and to be very effective in avoiding unnecessary inference and state collapse.
Critically, the RobotClient retains control over whether a given statement have to be processed, to avoid deadlocks.
Observations sent by the client and tagged with must_go=True are processed whatever the similarity metric.

Policy workflow

The policy workflow, during which incoming observations are in comparison with the last one used for inference, and processed provided that different enough, or `must_go`.

Lastly, to make sure the PolicyServer at all times processes the most recent available statement, we block incoming observations until the previous one has been successfully processed. On this, we leverage queues on the PolicyServer to make sure incoming observations should not enqueued until the server is able to process them (see below).

Client pings server

The client pings the server every 1/fps seconds, but observations should not enqueued for processing until the previous one has been successfully processed.

4. Analyzing async inference

For all practical purposes, in async inference there are two time-scales that matter:

Environment step $texttt{environment_dt} = 1/texttt{fps}$
Inference latency $texttt{inference_time}$

Importantly, the ratio $c = frac{texttt{environment_dt}}{texttt{inference_time}}$

$c ll 1$
$c ge 1$

Critically, $c$

Use more compute for the policy server, hosting the server on a GPU, reducing $texttt{inference_time}$
Sending observations to the server more often, send a brand new statement when the queue length
$k$ k drops below a fraction
$g = k/H$
g=k/H of its maximum size.

$g=0$

$g=1$

Experiments (see plots below) show that $gapprox0.7$

Queues

The number of accessible actions within the queue at any given time, as a function of g. Larger values of g end in more frequent updates, and more computational cost. Values of g closer to 0 reproduce sequential inference (empty queue, wait). We found g~0.7 to be a very good trade-off in our experiments.

5. Using async in your setup

Async inference is a straightforward yet effective technique to improve the performance of robotic policies. In our experiments using SmolVLA, async inference ends in a ~2x speedup in task completion time with comparable task success rate, and more adaptive control coming from a tighter loop.

To run your policy using async inference, you only must follow our tutorial along with your own custom parameters (e.g., the policy path or the chunk size threshold). Async inference comes with support for policies supporting motion chunking!

Conclusions

We’ve got introduced async inference, an easy yet effective technique to improve the performance of robotic policies. In our experiments using SmolVLA, async inference ends in a ~2x speedup in task completion time with comparable task success rate, and more adaptive control coming from a tighter loop.

We’re excited to share this work with the community, and to see how it might probably be used to enhance the performance of robotic policies. We welcome PRs to enhance and extend the async inference framework at huggingface/lerobot, and can be found to debate this further in our Discord community, 🤗.

Source link

Decoupling Motion Prediction and Execution

Table of Contents

Getting began

Async inference: a deep dive

1. Why sequential inference falls short

2. Asynchronous inference, in a nutshell

3. System Architecture

Robot Client

Policy Server

4. Analyzing async inference

5. Using async in your setup

Conclusions

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

Statement from Dario Amodei on our discussions with the Department of War Anthropic

Google quantum-proofs HTTPS by squeezing 2.5kB of information into 64-byte space – Ars Technica

Generative AI, Discriminative Human

Decoupling Motion Prediction and Execution

Table of Contents

Getting began

Async inference: a deep dive

1. Why sequential inference falls short

2. Asynchronous inference, in a nutshell

3. System Architecture

Robot Client

Policy Server

4. Analyzing async inference

5. Using async in your setup

Conclusions

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.