Today’s demanding AI developer workloads often need more memory than desktop systems provide or require access to software that laptops or PCs lack. This forces work to be moved to the cloud or data center.
NVIDIA DGX Spark provides a substitute for cloud instances and data-center queues. The Blackwell-powered, compact supercomputer comprises 1 petaflop of FP4 AI computer performance, 128 GB of coherent unified system memory, memory bandwidth of 273 GB/second, and the NVIDIA AI software stack preinstalled. With DGX Spark, you may work with large, compute intensive tasks locally, without moving to the cloud or data center.
We’ll walk you thru how DGX Spark’s compute performance, large memory, and preinstalled AI software speed up fine-tuning, image generation, data science, and inference workloads. Keep reading for some benchmarks.
Positive-tuning workloads on DGX Spark
Tuning pre-trained models is a typical task for AI developers. To indicate how DGX Spark performs at this workload, we ran three tuning tasks using different methodologies: full fine-tuning, LoRA, and QLoRA.
In full fine-tuning of a Llama 3.2B model, we reached a peak of 82,739.2 tokens per second. Tuning a Llama 3.1 8B model using LoRA on DGX Spark reached a peak of 53,657.6 tokens per second. Tuning a Llama 3.3 70B model using QLoRA on DGX Spark reached a peak of 5,079.4 tokens per second.
Since fine-tuning is so memory intensive, none of those tuning workloads can run on a 32 GB consumer GPU.
| Positive-tuning | ||||
| Model | Method | Backend | Configuration | Peak tokens/sec |
| Llama 3.2 3B | Full positive tuning | PyTorch | Sequence length: 2048 Batch size: 8 Epoch: 1 Steps: 125BF16 |
82,739.20 |
| Llama 3.1 8B | LoRA | PyTorch | Sequence length: 2048 Batch size: 4 Epoch: 1 Steps: 125BF16 |
53,657.60 |
| Llama 3.3 70B | QLoRA | PyTorch | Sequence length: 2048 Batch size: 8 Epoch: 1 Steps: 125FP4 |
5,079.04 |
DGX Spark’s image-generation capabilities
Image generation models are at all times pushing for greater accuracy, higher resolutions, and faster performance. Creating high-resolution images or multiple images per prompt drives the necessity for more memory, in addition to the compute required to generate the photographs.
DGX Spark’s large GPU memory and powerful compute performance permits you to work with larger-resolution images and higher-precision models to supply higher image quality. Support for the FP4 data format enables DGX Spark to generate images quickly, even at high resolutions.
Using the Flux.1 12B model at FP4 precision, DGX Spark can generate a 1K image every 2.6 seconds (see Table 2 below). DGX Spark’s large system memory provides the capability obligatory to run a BF16 SDXL 1.0 model and generate seven 1K images per minute.
| Image generation | ||||
| Model | Precision | Backend | Configuration | Images/min |
| Flux.1 12B Schnell | FP4 | TensorRT | Resolution: 1024×1024 Denoising steps: 4 Batch size: 1 |
23 |
| SDXL1.0 | BF16 | TensorRT | Resolution: 1024×1024 Denoising steps: 50 Batch size: 2 |
7 |
Using DGX Spark for data science
DGX Spark supports foundational CUDA-X libraries like NVIDIA cuML and cuDF. NVIDIA cuML accelerates machine-learning algorithms in scikit-learn, in addition to UMAP and HDBSCAN on GPUs with zero code changes required.
For computationally intensive ML algorithms like UMAP and HDBSCAN, DGX Spark can process 250 MB datasets in seconds. (See Table 3 below.) NVIDIA cuDF significantly accelerates common pandas data evaluation tasks like joins and string methods. cuDF pandas operations on datasets with tens of tens of millions of records run in only seconds on DGX Spark.
| Data science | |||
| Library | Benchmark | Dataset size | Time |
| NVIDIA cuML | UMAP | 250 MB | 4 secs |
| NVIDIA cuML | HDBSCAN | 250 MB | 10 secs |
| NVIDIA cuDF pandas | Key data evaluation operations (joins, string methods, UDFs) | 0.5 to five GB | 11 secs |
Using DGX Spark for inference
DGX Spark’s Blackwell GPU supports the FP4 data format, specifically the NVFP4 data format that gives near-FP8 accuracy (<1% degradation). This permits use of smaller models without sacrificing accuracy. The smaller data footprint of FP4 also improves performance. Table 4 below provides inference performance data for DGX Spark.
DGX Spark supports a variety of 4-bit data formats: NVFP4, MXFP4, in addition to many backends equivalent to TRT-LLM, llama.cpp, and vLLM. The system’s 1 petaflop of AI performance enables it to deliver fast prompt processing, as shown in Table 4. The fast prompt processing ends in a faster time-to-first response token, which delivers a greater experience for users and accelerates end-to-end throughput.
| Inference (ISL|OSL= 2048|128, BS=1) | |||||
| Model | Precision | Backend | Prompt processing throughput (tokens/sec) |
Token generation throughput (tokens/sec) |
|
| Qwen3 14B | NVFP4 | TRT-LLM | 5928.95 | 22.71 | |
| GPT-OSS-20B | MXFP4 | llama.cpp | 3670.42 | 82.74 | |
| GPT-OSS-120B | MXFP4 | llama.cpp | 1725.47 | 55.37 | |
| Llama 3.1 8B | NVFP4 | TRT-LLM | 10256.9 | 38.65 | |
| Qwen2.5-VL-7B-Instruct | NVFP4 | TRT-LLM | 65831.77 | 41.71 | |
| Qwen3 235B (on dual DGX Spark) |
NVFP4 | TRT-LLM | 23477.03 | 11.73 | |
NVFP4: 4-bit floating point format was introduced with the NVIDIA Blackwell GPU architecture. MXFP4: Microscaling FP4 is a 4-bit floating point format created by the Open Compute Project (OCP). ISL (Input Sequence Length): Variety of tokens within the input prompt (a.k.a. prefill tokens). And OSL (Output Sequence Length): Variety of tokens generated by the model in response (a.k.a. decode tokens).
We also connected two DGX Sparks together via their ConnectX-7 chips to run the Qwen3 235B model. The model uses over 120 GB of memory, including overhead. Such models typically run on large cloud or data-center servers, however the undeniable fact that they will run on dual DGX Spark systems shows what’s possible for developer experimentation. As shown within the last row of Table 4, the token generation throughput on dual DGX Sparks was 11.73 tokens per second.
The brand new NVFP4 version of the NVIDIA Nemotron Nano 2 model also performs well on DGX Spark. With the NVFP4 version, you may now achieve as much as 2x higher throughput with little to no accuracy degradation. Download the model checkpoints from Hugging Face or as an NVIDIA NIM.
And get your DGX Spark, join the DGX Spark developer community, and begin your AI-building journey today.
