Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

Autonomous AI agents are driving the following wave of AI innovation. These agents must often manage long-running tasks that use multiple communication channels and background subprocesses concurrently to explore options, test solutions, and generate optimal results. This places extreme demands on local compute.

NVIDIA DGX Spark provides the performance obligatory for autonomous agents to execute these complex workflows efficiently and locally. Now with NVIDIA NemoClaw, a part of the NVIDIA Agent Toolkit, it installs the NVIDIA OpenShell runtime—a secure environment for running autonomous agents, and open source models like NVIDIA Nemotron.

This post discusses several essential features of system capabilities and performance which are obligatory to power always-on autonomous agents and explains why NVIDIA DGX Spark is a great desktop platform for autonomous AI.

Model	End-to-end latency (s)	Prompt processing latency (s)	Prompt processing throughput (tok/s)	Token generation throughput (tok/s)
NVIDIA Nemotron 3 Super 120B NVFP4 with TensorRT LLM	99	44	2,855	18
Qwen3.5 35B A3B FP8 with vLLM	73	41	3,080	35.75
Qwen3 Coder Next 80B FP8 with vLLM	89	54	2,390	28.95

Concurrency (# of simultaneous tasks)	End-to-end latency (s)	Median TTFT (s)	Prompt processing throughput (tok/s)	Token generation throughput (tok/s)
	Lower is best		Higher is best
1	35	9	3,261	38
2	54	12	5,363	47
4	91	15	9,616	53

	1 DGX Spark node TP1 (ms)	2 DGX Spark nodes TP2 (ms)	4 DGX Spark nodes TP4 (ms)
TTFT (lower is best)	33,415	21,384	15,552
TPOT (lower is best)	269	133	72

	1 DGX Spark node TP1	2 DGX Spark nodes TP2	4 DGX Spark nodes TP4
Collection time	12.1 s	11.4 s	10.4 s
Learning time	40.9 s	41.4 s	42.3 s
# environments	1,024	1,024	1,024
FPS	630	1241	2,520

HW configuration	Total token throughput (tok/s)	Speedup versus 1 DGX Spark node
1 DGX Spark node	~18,400	1
2 DGX Spark nodes	~35,900	2
4 DGX Spark nodes	~74,600	4

Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

Inference for autonomous AI agents

Scale inference and fine-tuning on as much as 4 NVIDIA DGX Spark nodes

Parallelism for AI agents: Inference at scale

Near-linear fine-tuning

Develop on DGX Spark, deploy to the cloud: Cross-architecture workflows

End-to-end inference performance

Platform-specific configuration

Roofline evaluation and comparison of Tile IR kernel performance

Performance scaling and optimization headroom

Cache utilization and arithmetic intensity

Automated cross-platform autotuning

Start with NVIDIA DGX Spark

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Methods to Construct a Production-Ready Claude Code Skill

How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

Follow the AI Footpaths

Nurturing agentic AI beyond the toddler stage

Using Simulation to Construct Robotic Systems for Hospital Automation

Nodes	Samples/step	Batch size	Samples/s	Speedup
1 DGX Spark node	15.73	32	2.03	–
3 DGX Spark nodes	15.69	96	6.12	3x

Parameter	Value
Model	Qwen2 7B
Input length	2,189 tokens
Output length	128 tokens
Batch sizes	1, 2, 4, 8, 16, 32, 64, 128

Specification	NVIDIA DGX Spark (Dev)	NVIDIA Blackwell B200 (Cloud)
Compute capability	SM 12.1	SM 10.0
SM count	48	148
SM frequency	2.14 GHz	~1.0 GHz
Memory type	LPDDR5X (Unified)	HBM3e
Memory bandwidth	273 GB/s	~8 TB/s

Platform	TILE_M	TILE_N	Occupancy	Rationale
NVIDIA DGX Spark (SM 12.1)	64	64	2	Smaller tiles 48 SMs, unified memory
NVIDIA B200 (SM 10.0)	256	128	1	Large tiles maximize HBM3e throughput
NVIDIA B200 (alt)	128	128	2	Higher occupancy, balanced parallelism

Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark

Inference for autonomous AI agents

Scale inference and fine-tuning on as much as 4 NVIDIA DGX Spark nodes

Parallelism for AI agents: Inference at scale

Near-linear fine-tuning

Develop on DGX Spark, deploy to the cloud: Cross-architecture workflows

End-to-end inference performance

Platform-specific configuration

Roofline evaluation and comparison of Tile IR kernel performance

Performance scaling and optimization headroom

Cache utilization and arithmetic intensity

Automated cross-platform autotuning

Start with NVIDIA DGX Spark

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.