How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks

-


Today’s demanding AI developer workloads often need more memory than desktop systems provide or require access to software that laptops or PCs lack. This forces work to be moved to the cloud or data center.

NVIDIA DGX Spark provides a substitute for cloud instances and data-center queues. The Blackwell-powered, compact supercomputer comprises 1 petaflop of FP4 AI computer performance, 128 GB of coherent unified system memory, memory bandwidth of 273 GB/second, and the NVIDIA AI software stack preinstalled. With DGX Spark, you may work with large, compute intensive tasks locally, without moving to the cloud or data center.

We’ll walk you thru how DGX Spark’s compute performance, large memory, and preinstalled AI software speed up fine-tuning, image generation, data science, and inference workloads. Keep reading for some benchmarks.

Positive-tuning workloads on DGX Spark

Tuning pre-trained models is a typical task for AI developers. To indicate how DGX Spark performs at this workload, we ran three tuning tasks using different methodologies: full fine-tuning, LoRA, and QLoRA. 

In full fine-tuning of a Llama 3.2B model, we reached a peak of 82,739.2 tokens per second. Tuning a Llama 3.1 8B model using LoRA on DGX Spark reached a peak of 53,657.6 tokens per second. Tuning a Llama 3.3 70B model using QLoRA on DGX Spark reached a peak of 5,079.4 tokens per second. 

Since fine-tuning is so memory intensive, none of those tuning workloads can run on a 32 GB consumer GPU.

Positive-tuning
Model Method Backend Configuration Peak tokens/sec
Llama 3.2 3B Full positive tuning PyTorch Sequence length: 2048
Batch size: 8
Epoch: 1
Steps: 125BF16
82,739.20
Llama 3.1 8B LoRA PyTorch Sequence length: 2048
Batch size: 4
Epoch: 1
Steps: 125BF16
53,657.60
Llama 3.3 70B QLoRA PyTorch Sequence length: 2048
Batch size: 8
Epoch: 1
Steps: 125FP4
5,079.04
Table 1. Positive-tuning performance

DGX Spark’s image-generation capabilities

Image generation models are at all times pushing for greater accuracy, higher resolutions, and faster performance. Creating high-resolution images or multiple images per prompt drives the necessity for more memory, in addition to the compute required to generate the photographs.

DGX Spark’s large GPU memory and powerful compute performance permits you to work with larger-resolution images and higher-precision models to supply higher image quality. Support for the FP4 data format enables DGX Spark to generate images quickly, even at high resolutions.

Using the Flux.1 12B model at FP4 precision, DGX Spark can generate a 1K image every 2.6 seconds (see Table 2 below). DGX Spark’s large system memory provides the capability obligatory to run a BF16 SDXL 1.0 model and generate seven 1K images per minute.

Image generation
Model Precision Backend Configuration Images/min
Flux.1 12B Schnell FP4 TensorRT Resolution: 1024×1024 
Denoising steps: 4 
Batch size: 1
23
SDXL1.0 BF16 TensorRT Resolution: 1024×1024
Denoising steps: 50
Batch size: 2
7
Table 2. Image-generation performance

Using DGX Spark for data science

DGX Spark supports foundational CUDA-X libraries like NVIDIA cuML and cuDF. NVIDIA cuML accelerates machine-learning algorithms in scikit-learn, in addition to UMAP and HDBSCAN on GPUs with zero code changes required. 

For computationally intensive ML algorithms like UMAP and HDBSCAN, DGX Spark can process 250 MB datasets in seconds. (See Table 3 below.) NVIDIA cuDF significantly accelerates common pandas data evaluation tasks like joins and string methods. cuDF pandas operations on datasets with tens of tens of millions of records run in only seconds on DGX Spark.

Data science
Library Benchmark Dataset size Time
NVIDIA cuML UMAP 250 MB 4 secs
NVIDIA cuML HDBSCAN 250 MB 10 secs
NVIDIA cuDF pandas Key data evaluation operations (joins, string methods, UDFs) 0.5 to five GB 11 secs
Table 3. Data-science performance

Using DGX Spark for inference

DGX Spark’s Blackwell GPU supports the FP4 data format, specifically the NVFP4 data format that gives near-FP8 accuracy (<1% degradation). This permits use of smaller models without sacrificing accuracy. The smaller data footprint of FP4 also improves performance. Table 4 below provides inference performance data for DGX Spark.

DGX Spark supports a variety of 4-bit data formats: NVFP4, MXFP4, in addition to many backends equivalent to TRT-LLM, llama.cpp, and vLLM. The system’s 1 petaflop of AI performance enables it to deliver fast prompt processing, as shown in Table 4. The fast prompt processing ends in a faster time-to-first response token, which delivers a greater experience for users and accelerates end-to-end throughput. 

Inference (ISL|OSL= 2048|128, BS=1)
Model Precision Backend Prompt processing throughput
(tokens/sec)
Token generation throughput
(tokens/sec)
Qwen3 14B NVFP4 TRT-LLM 5928.95 22.71
GPT-OSS-20B MXFP4 llama.cpp 3670.42 82.74
GPT-OSS-120B MXFP4 llama.cpp 1725.47 55.37
Llama 3.1 8B NVFP4 TRT-LLM 10256.9 38.65
Qwen2.5-VL-7B-Instruct NVFP4 TRT-LLM 65831.77 41.71
Qwen3 235B
(on dual DGX Spark)
NVFP4 TRT-LLM 23477.03 11.73
Table 4. Inference performance

NVFP4: 4-bit floating point format was introduced with the NVIDIA Blackwell GPU architecture. MXFP4: Microscaling FP4 is a 4-bit floating point format created by the Open Compute Project (OCP). ISL (Input Sequence Length): Variety of tokens within the input prompt (a.k.a. prefill tokens). And OSL (Output Sequence Length): Variety of tokens generated by the model in response (a.k.a. decode tokens).

We also connected two DGX Sparks together via their ConnectX-7 chips to run the Qwen3 235B model. The model uses over 120 GB of memory, including overhead. Such models typically run on large cloud or data-center servers, however the undeniable fact that they will run on dual DGX Spark systems shows what’s possible for developer experimentation. As shown within the last row of Table 4, the token generation throughput on dual DGX Sparks was 11.73 tokens per second. 

The brand new NVFP4 version of the NVIDIA Nemotron Nano 2 model also performs well on DGX Spark. With the NVFP4 version, you may now achieve as much as 2x higher throughput with little to no accuracy degradation. Download the model checkpoints from Hugging Face or as an NVIDIA NIM

And get your DGX Spark, join the DGX Spark developer community, and begin your AI-building journey today. 



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x