Autonomous AI agents are driving the following wave of AI innovation. These agents must often manage long-running tasks that use multiple communication channels and background subprocesses concurrently to explore options, test solutions, and generate optimal results. This places extreme demands on local compute.
NVIDIA DGX Spark provides the performance obligatory for autonomous agents to execute these complex workflows efficiently and locally. Now with NVIDIA NemoClaw, a part of the NVIDIA Agent Toolkit, it installs the NVIDIA OpenShell runtime—a secure environment for running autonomous agents, and open source models like NVIDIA Nemotron.
This post discusses several essential features of system capabilities and performance which are obligatory to power always-on autonomous agents and explains why NVIDIA DGX Spark is a great desktop platform for autonomous AI.
Inference for autonomous AI agents
Agentic tools often must process massive context windows. OpenClaw, for instance, is an AI agent runtime that requires these large context windows to understand requests and environments, and to think through one of the best approach to an issue.
Prompt processing (prefill) throughput might be regarded as the reading comprehension phase of inference and might easily grow to be a bottleneck with a slow GPU. It’s common to see autonomous agents easily using contexts of 30K-120K tokens (100K tokens is corresponding to reading Harry Potter and the Philosopher’s Stone), with some agents processing 250K tokens for complex requests.
Table 1 shows how a possible agent or subagent performs with a big context window, (128K/1K of ISL/OSL).
| Model | End-to-end latency (s) |
Prompt processing latency (s) |
Prompt processing throughput (tok/s) |
Token generation throughput (tok/s) |
| NVIDIA Nemotron 3 Super 120B NVFP4 with TensorRT LLM | 99 | 44 | 2,855 | 18 |
| Qwen3.5 35B A3B FP8 with vLLM | 73 | 41 | 3,080 | 35.75 |
| Qwen3 Coder Next 80B FP8 with vLLM | 89 | 54 | 2,390 | 28.95 |
When moving from a single subagent to multiple subagents, simultaneous workloads must scale without impacting performance significantly. NVIDIA DGX Spark effectively handles high concurrency on this scenario.
Because of the ability of the NVIDIA Grace Blackwell Superchip, the GPU can parallelize multiple subagents. Two, 4, and even eight subagents concurrently working through requests could make use of the strong concurrency capabilities in DGX Spark.
With support from frameworks that handle concurrency well (corresponding to NVIDIA TensorRT LLM, vLLM, and SGLang), multiagent workloads run easily on NVIDIA DGX Spark. For tasks with 32K ISL of 1K OSL, completing 4 times as many tasks requires only 2.6x more time, while prompt processing throughput increases by about 3x (Table 2).
NVIDIA DGX Spark is a great platform for OpenClaw development. With NVIDIA OpenShell, you possibly can run autonomous, self-evolving agents more safely. Start running OpenClaw locally on NVIDIA DGX Spark.
| Concurrency (# of simultaneous tasks) |
End-to-end latency (s) |
Median TTFT (s) |
Prompt processing throughput (tok/s) |
Token generation throughput (tok/s) |
| Lower is best | Higher is best | |||
| 1 | 35 | 9 | 3,261 | 38 |
| 2 | 54 | 12 | 5,363 | 47 |
| 4 | 91 | 15 | 9,616 | 53 |
Scale inference and fine-tuning on as much as 4 NVIDIA DGX Spark nodes
Larger models and multiple subagents require more memory to load and execute. Until now, NVIDIA DGX Spark has supported scaling as much as two nodes, increasing the available memory from 128 GB on one node to 256 GB on two nodes. This capability has now been increased to as much as 4 DGX Spark nodes.
DGX Spark also now supports several execution topologies, each tailored to different goals through the low latency of RoCE communication enabled by ConnectX-7 NICs.
- One DGX Spark node: Ideal for low latency, large context size inference, fine-tuning as much as 120B parameters, and native agentic workloads
- Two DGX Spark nodes: Balanced scaling for faster fine-tuning and bigger models, in addition to support for as much as 400B-parameter inference
- Three DGX Spark nodes in a hoop: Ideal for fine-tuning larger models or small training jobs
- 4 DGX Spark nodes with RoCE 200 GbE switch: Local inference server ideal for state-of-the-art models as much as 700B parameters, communication intensive workloads, and native AI factory operations
Inference can scale up linearly on DGX Spark when internode communication is minimal. When work is basically independent per GPU, the outcomes are aggregated once at the tip reasonably than constantly. On this case, DGX Spark nodes can run in parallel with low synchronization overhead.
For instance, a reinforcement learning (RL) workload in NVIDIA Isaac Lab can run many simulations independently on each node. Results are collected in a single step, yielding near-linear scaling across multiple DGX Spark nodes.
Inference scaling is lower than linear when the workload requires frequent, fine-grained communication between nodes. During LLM inference, model execution occurs layer by layer, with continuous synchronization required across nodes. Partial results from different DGX Spark nodes should be exchanged and merged repeatedly, which introduces significant communication overhead. As additional nodes are added, this overhead becomes increasingly dominant, limiting scaling efficiency.
Parallelism for AI agents: Inference at scale
Tensor parallelism enables efficient inference sharing across multiple nodes to suit the model while minimizing communication overhead. Scaling from two to 4 DGX Spark nodes provides excellent parallelism capabilities. That is due to the low-latency ConnectX-7 NICs, scaling in time per output token (TPOT) almost linearly with ~2x with TP2 (two nodes) and ~4x with TP4 (4 nodes) in inference use cases.
Table 3 shows how a single agent performs an inference job shared across multiple nodes.
| 1 DGX Spark node TP1 (ms) |
2 DGX Spark nodes TP2 (ms) |
4 DGX Spark nodes TP4 (ms) |
|
| TTFT (lower is best) | 33,415 | 21,384 | 15,552 |
| TPOT (lower is best) | 269 | 133 | 72 |
Several models which are popular within the context of OpenClaw—including Qwen3.5 397B, GLM 5, and MiniMax M2.5 230B—can profit from stacking multiple DGX Spark units, increasing the available memory.
Near-linear fine-tuning
Positive-tuning and similar workloads might be significantly parallelized with close-to-linear performance scaling when the model instance can fit on one GPU. This reduces the communication overhead to only gradient synchronization at the tip of every step.
An RL workload in NVIDIA Isaac Lab or Nanochat can profit from this performance scaling. Isaac Lab can accommodate several copies of every environment on each DGX Spark. For every step, Isaac Lab communicates to the opposite nodes to synchronize the training, achieving linear speedup through clustering.
| 1 DGX Spark node TP1 |
2 DGX Spark nodes TP2 |
4 DGX Spark nodes TP4 |
|
| Collection time | 12.1 s | 11.4 s | 10.4 s |
| Learning time | 40.9 s | 41.4 s | 42.3 s |
| # environments | 1,024 | 1,024 | 1,024 |
| FPS | 630 | 1241 | 2,520 |
| HW configuration | Total token throughput (tok/s) |
Speedup versus 1 DGX Spark node |
| 1 DGX Spark node | ~18,400 | 1 |
| 2 DGX Spark nodes | ~35,900 | 2 |
| 4 DGX Spark nodes | ~74,600 | 4 |
When using distributed data parallel (DDP), fine-tuning can similarly profit from the low communication overhead. On this case, each node can fully host a replica of the model and communicate with the opposite nodes once per step.
| Nodes | Samples/step | Batch size | Samples/s | Speedup |
| 1 DGX Spark node | 15.73 | 32 | 2.03 | – |
| 3 DGX Spark nodes | 15.69 | 96 | 6.12 | 3x |
Develop on DGX Spark, deploy to the cloud: Cross-architecture workflows
Cloud solutions are required when moving from prototyping to large-scale production deployment. This section explains how workloads developed on DGX Spark might be deployed within the cloud.
Tile IR and cuTile Python enable seamless kernel portability from DGX Spark development environments to cloud deployment on NVIDIA Blackwell data center GPUs, with minimal code changes. Using TileGym, developers can:
- Write kernels once using cuTile Python DSL
- Test and validate on DGX Spark
- Deploy to NVIDIA Blackwell B300/B200, NVIDIA Hopper, or NVIDIA Ampere with minimal code changes
- Leverage TileGym preoptimized transformer kernels as drop-in replacements
End-to-end inference performance
Beyond kernel-level evaluation, we benchmarked complete Qwen2 7B inference using cuTile kernels on each platforms to show cross-architecture performance portability. Table 7 shows the configuration; Table 8 shows the platform specification.
| Parameter | Value |
| Model | Qwen2 7B |
| Input length | 2,189 tokens |
| Output length | 128 tokens |
| Batch sizes | 1, 2, 4, 8, 16, 32, 64, 128 |
| Specification | NVIDIA DGX Spark (Dev) | NVIDIA Blackwell B200 (Cloud) |
| Compute capability | SM 12.1 | SM 10.0 |
| SM count | 48 | 148 |
| SM frequency | 2.14 GHz | ~1.0 GHz |
| Memory type | LPDDR5X (Unified) | HBM3e |
| Memory bandwidth | 273 GB/s | ~8 TB/s |
Platform-specific configuration
While the kernel source code stays an identical across platforms, optimal performance is achieved through platform-specific configurations (Tile and Occupancy). For the FMHA kernel example, Table 9 shows how these configurations adapt to different hardware characteristics. Tile IR compiles to architecture-specific PTX/SASS at JIT, mechanically leveraging platform-specific features like Tensor Memory Accelerator (TMA) using the suitable configuration.
| Platform | TILE_M | TILE_N | Occupancy | Rationale |
| NVIDIA DGX Spark (SM 12.1) | 64 | 64 | 2 | Smaller tiles 48 SMs, unified memory |
| NVIDIA B200 (SM 10.0) | 256 | 128 | 1 | Large tiles maximize HBM3e throughput |
| NVIDIA B200 (alt) | 128 | 128 | 2 | Higher occupancy, balanced parallelism |
Roofline evaluation and comparison of Tile IR kernel performance
Roofline evaluation in NVIDIA Nsight Compute is a strong visual performance framework used to find out how well an application is utilizing hardware capabilities. As a developer, roofline evaluation helps you work out whether your code is “slow” and shows why it might be hitting a performance ceiling.
Evaluation of the roofline model suggests that the kernel scales effectively relative to the respective roofline, demonstrating that Tile IR is a viable choice to scale workloads. The kernel considered is the eye decode kernel and the kernel is optimized using Tile IR.


Performance scaling and optimization headroom
In Figure 1, the vertical positioning of the information points on the y-axis confirms that the kernel achieves higher hardware utilization on NVIDIA B200. Specifically, the vertical proximity of the blue dot to the NVIDIA B200 GPU memory roofline is larger than that of the green dot to the Spark roofline.
This roofline evaluation indicates additional opportunities for optimization, and that algorithmic or memory optimizations of NVIDIA DGX Spark may also profit NVIDIA B200 GPUs.
Cache utilization and arithmetic intensity
Evaluation of the x-axis reveals that the blue dot is positioned to the best of the green dot, signifying that the B200 achieves superior Hardware Arithmetic Intensity.
- Cache efficiency: While the larger cache capability of NVIDIA B200 GPU provides the theoretical foundation for reducing DRAM traffic, hardware alone is insufficient. The software should be architected to use these resources.
- Kernel portability: The rightward shift indicates that Tile IR kernels successfully leverage the NVIDIA B200 expanded cache hierarchy on migration.
Future Tile IR kernel optimizations geared toward increasing arithmetic intensity on Spark—moving the information point further right along the x-axis—will inherently end in compounded performance advantages when running on various cloud GPUs.
Automated cross-platform autotuning
Currently, optimal configurations are chosen based on platform characteristics. Future releases of cuTile will support fully automated cross-platform autotuning. The autotuner will discover optimal tile sizes and occupancy settings for every goal architecture mechanically, enabling transparent performance portability with none manual configuration.
Start with NVIDIA DGX Spark
As AI systems grow to be more sophisticated, NVIDIA DGX Spark provides the flexible, multitopology execution environment required to deploy them efficiently. From multiagent inference to trillion-parameter serving, from fine-tuning to Tile IR cross-cloud pipelines, DGX Spark delivers each scalability and efficiency.
The result’s a unified platform where enterprises can deploy and scale AI workloads—without rewriting infrastructure for each model or runtime.
