Solving large-scale problems in Electronic Design Automation (EDA), Computational Fluid Dynamics (CFD), and advanced optimization workflows has grow to be the norm as chip designs, manufacturing, and multi-physics simulations have grown in complexity. These workloads push traditional solvers and require unprecedented scalability and performance. The NVIDIA CUDA Direct Sparse Solver (cuDSS) is built for users to run sparse solvers at massive scale with minimal code changes, unlocking breakthrough speed and efficiency for next-generation engineering and design.
You’ll be able to leverage your CPU/GPU using hybrid memory mode to run larger problems that otherwise wouldn’t slot in a single GPU memory, or run a workload across multiple GPUs, and even scale to multiple nodes effortlessly. This blog discusses cuDSS user strategies for solving large-scale problems.
Getting began
To start, this blog assumes you have already got a working code that uses cuDSS. You might have also explored the introductory examples on GitHub (here and here) that reveal running cuDSS on a single GPU and adjusting default solution parameters using Get and Set functions. These examples cover creating matrices and most important cuDSS objects, and executing the three core phases of cuDSS: evaluation, numerical factorization, and solution.
Due to the increased memory capability of recent GPU generations, even a single GPU can handle fairly large sparse problems. Nonetheless, when tackling truly massive problems—on the order of over 10 million rows and over a billion nonzeros—there are effective strategies to make cuDSS run fast and efficiently. The primary approach still uses a single GPU but introduces techniques to handle these larger challenges without major code changes.
Rethink your data types: Why INT64 matters now
Once you create a dense or sparse matrix for cuDSS, you’re more likely to use one among two functions, cudssMatrixCreateDn() or cudssMatrixCreateCsr() and even each. From the documentation, the functions are described below.
cudssMatrixCreateDn
cudssStatus_t cudssMatrixCreateDn(
cudssMatrix_t *matrix,
int64_t nrows,
int64_t ncols,
int64_t ld,
void *values,
cudaDataType_t valueType,
cudssLayout_t layout
)
The second function, cudssMatrixCreateCsr(), is shown next.
cudssMatrixCreateCsr
cudssStatus_t cudssMatrixCreateCsr(
cudssMatrix_t *matrix,
int64_t nrows,
int64_t ncols,
int64_t nnz,
void *rowStart,
void *rowEnd,
void *colIndices,
void *values,
cudaDataType_t indexType,
cudaDataType_t valueType,
cudssMatrixType_t mtype,
cudssMatrixViewType_t mview,
cudssIndexBase_t indexBase
)
In cuDSS versions before 0.7.0, indices of the sparse matrices could only use 32-bit integers. Specifically, the underlying data type for rowStart, rowEnd, and colIndices could only be int and parameter indexType could only be CUDA_R_32I. From cuDSS 0.7.0 onward, you’ll be able to solve larger problems through the use of 64-bit integer indexing arrays of type int64 and CUDA_R_64I for the indexType.
Note: There may be a limitation of getting lower than 2^31 rows and columns within the input matrix (but with 64-bit indices, the input matrix can have many more nonzeros).
Hybrid memory mode—blurring the road between CPU and GPU
cuDSS hybrid memory mode is designed to beat the memory limitations of a single GPU when solving extremely large sparse linear problems through the use of the GPU and CPU memories.
Nonetheless, there’s a tradeoff: Data transfer between CPU and GPU takes time and is governed by bus bandwidth. Whilst you get to tackle larger problems, you must expect some performance hit on account of these transfers. That said, due to modern NVIDIA driver optimizations and fast CPU/GPU interconnects (comparable to those in NVIDIA Grace Blackwell nodes), the penalty is manageable—and for certain problem sizes, hybrid memory performance scales impressively.
Hybrid memory mode will not be on by default, so step one to enable it’s to call the function cudssConfigSet() to set CUDSS_CONFIG_HYBRID_MODE, which tells cuDSS to make use of hybrid memory mode. Note this modification must be done prior to calling cudssExecute().
The device memory is managed by cuDSS robotically by default. It manages how much device memory is required—as much as the whole GPU incorporates. Alternatively, users can specify a smaller memory footprint by setting a user-defined limit within the range from the worth of CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN as much as the available device memory after the evaluation (symbolic factorization) phase, which might be queried via the NVIDIA CUDA Runtime API cudaMemGetInfo. A couple of highlights to notice:
- Even when the hybrid memory is on, cuDSS first attempts to utilize device memory (and avoids using CPU memory if possible) to realize best performance.
- Best performance is achieved with using the utmost GPU memory (which might make fewer memory transfers between the CPU and the GPU)
- Hybrid memory limit might be set per device (as shown in the following text block)
The example code walks you thru fetching minimum device memory requirements and setting your memory limits accordingly, providing you with high-quality control over memory footprints.
...
/* Enable hybrid mode where aspects are stored in host memory
Note: It should be set before the primary call to ANALYSIS step.*/
int hybrid_mode = 1;
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig, CUDSS_CONFIG_HYBRID_MODE,
&hybrid_mode,sizeof(hybrid_mode)), status,
"cudssConfigSet CUDSS_CONFIG_HYBRID_MODE");
/* Symbolic factorization */
...
/* (optional) User can query the minimal amount of device memory sufficient
for the hybrid memory mode.
Note: By default, cuDSS would try to use all available
device memory if needed */
size_t sizeWritten;
int64_t device_memory_min;
CUDSS_CALL_AND_CHECK(cudssDataGet(handle, solverData,
CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN,
&device_memory_min, sizeof(device_memory_min),
&sizeWritten), status,
"cudssDataGet for
CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN");
printf("cuDSS example: minimum amount of device memoryn"
"for the hybrid memory mode is %ld bytesn",
device_memory_min);
/* (optional) User can specify how much device memory is on the market for
cuDSS
Note: By default, cuDSS would try to use all available
device memory if needed */
int64_t hybrid_device_memory_limit = 40 * 1024 ; // in bytes = 40 KB
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig,
CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT,
&hybrid_device_memory_limit,
sizeof(hybrid_device_memory_limit)),
status,
"cudssConfigSet for
CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT");
printf("cuDSS example: set the upper limit on device memoryn"
"for the hybrid memory mode to %ld bytesn",
hybrid_device_memory_limit);
/* Factorization */
...
/* Solving */
...
The primary cuDSS function, called cudssConfigSet(), enables hybrid memory mode before calling the primary evaluation step, symbolic factorization. That is followed through the use of cudssDataGet() to seek out the minimal amount of device memory sufficient for hybrid memory mode. A function call, cudssConfigSet() specifies the quantity of device memory for cuDSS. Note that sometimes the automated memory management leads to out of memory (OOM) errors.
For developers integrating this, the documentation on debugging suggestions are gold—save yourself some headaches by giving them a read.
Hybrid memory mode performance relies on CPU/GPU memory bandwidth to maneuver data from CPU to GPU. For instance this, Figure 1 below shows the factorization and solve speed-up for matrix sizes starting from 1 million to 18 million which might be solved using cuDSS’s hybrid memory mode. The baseline is a single NVIDIA B200 GPU node. The observed speed-up compares the identical model executed on a Grace Blackwell node to a x86 Blackwell node, reflecting the memory bandwidth ratio between the 2 nodes.


With INT64 and hybrid memory mode cuDSS coding strategies, we will accommodate large problem sizes and we’re using all of the possible memory we will on the node if we want it. But we’re still limited to a single GPU. The following strategy allows us to make use of more GPUs to accommodate larger problems. This also allows us to resolve problems of a set size faster through the use of more GPUs.
cuDSS multi-GPU mode (MG mode) allows the developer to make use of the entire GPUs in a single node without the developer having to specify any distributed communication layer. cuDSS handles all communication needed to make use of the GPUs internally. It is useful in three scenarios:
- When the issue is simply too large to suit on a single device (with or without hybrid memory).
- When the user desires to avoid the performance penalty of hybrid memory mode.
- When the user is concentrated on strong scaling—solving the issue across more GPUs to achieve an answer faster.
The highlight of MG mode is that the developer doesn’t must specify a communication layer: No MPI, no NCCL, and no other communication layer must be used. cuDSS does all of this for you.
Moreover, on account of CUDA’s limitations with MPI-aware communication on Windows nodes, MG mode becomes particularly helpful for applications running on Windows.
Figure 2 below illustrates the time (in seconds) required to resolve an roughly 30-million-row matrix on an NVIDIA DGX H200 node across one-, two-, and four-GPU configurations with the factorization time on the highest chart and the solve time on the underside chart. The initial computation was performed on a single GPU, followed by runs using two and 4 GPUs with MG mode. As shown, solving the model with two GPUs significantly reduces computation time in comparison with a single GPU, albeit at the price of increased GPU resource usage.


This example shows you learn how to utilize MG mode. The relevant parts of the code are summarized below. Note that this includes code for using hybrid memory mode. This is significant because when you use hybrid memory, you might have to set the device memory limits on the entire devices that might be used.
...
/* Creating the cuDSS library handle */
cudssHandle_t handle;
/* Query the actual number of accessible devices */
int device_count = 0;
cuda_error = cudaGetDeviceCount(&device_count);
if (cuda_error != cudaSuccess || device_count <= 0) {
printf("ERROR: no GPU devices foundn");
fflush(0);
return -1;
}
/* device_indices might be set to NULL. In that cases cuDSS will take devices
* from 0 to (device_count - 1)
*/
int *device_indices = NULL;
device_indices = (int *)malloc(device_count * sizeof(int));
if (device_indices == NULL) {
printf("ERROR: did not allocate host memoryn");
fflush(0);
return -1;
}
for (int i = 0; i < device_count; i++)
device_indices[i] = i;
...
/* Initialize cudss handle for multiple devices */
CUDSS_CALL_AND_CHECK(cudssCreateMg(&handle, device_count, device_indices),
status, "cudssCreate");
...
/* Creating cuDSS solver configuration and data objects */
cudssConfig_t solverConfig;
cudssData_t solverData;
CUDSS_CALL_AND_CHECK(cudssConfigCreate(&solverConfig), status,
"cudssConfigCreate");
/* Pass same device_count and device_indices to solverConfig */
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig,
CUDSS_CONFIG_DEVICE_COUNT, &device_count,
sizeof(device_count)), status,
"cudssConfigSet for device_count");
CUDSS_CALL_AND_CHECK(cudssConfigSet(solverConfig,
CUDSS_CONFIG_DEVICE_INDICES, device_indices,
device_count * sizeof(int)), status,
"cudssConfigSet for device_count");
CUDSS_CALL_AND_CHECK(cudssDataCreate(handle, &solverData), status,
"cudssDataCreate");
...
/* Symbolic factorization */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_ANALYSIS,
solverConfig, solverData, A, x, b),
status, "cudssExecute for evaluation");
...
/* Query CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN needs to be done for every device
* individually by calling cudaSetDevice() prior to cudssDataGet.
* Same for getting CUDSS_DATA_MEMORY_ESTIMATES.
* Same for setting CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT with
* cudssConfigSet()
*/
int default_device = 0;
cudaGetDevice(&default_device);
for (int dev_id = 0; dev_id < device_count; dev_id++) {
cudaSetDevice(device_indices[dev_id]);
int64_t hybrid_device_memory_limit = 0;
size_t sizeWritten;
CUDSS_CALL_AND_CHECK(cudssDataGet(handle, solverData,
CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN,
&hybrid_device_memory_limit,
sizeof(hybrid_device_memory_limit),
&sizeWritten),
status, "cudssDataGet for the memory estimates");
printf("dev_id = %d CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN %ld bytesn",
device_indices[dev_id], hybrid_device_memory_limit);
}
/* cuDSS requires all API calls to be made on the default device, so
* resseting device context.
*/
cudaSetDevice(default_device);
/* Factorization */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_FACTORIZATION,
solverConfig, solverData, A, x, b),
status, "cudssExecute for factor");
/* Solving */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE, solverConfig,
solverData, A, x, b), status,
"cudssExecute for solve");
...
Establishing MG mode is simple. It begins by finding the variety of devices on the node and using all of them, or the particular variety of devices you wish. Then the device indices are set to the variety of devices starting with device 0 (the code will use the primary device_count devices, which you’ll change to the particular device numbers when you want). You would easily have the variety of devices and the device number list input on the command line or via a file to make your code more flexible.
After this, the particular MG coding begins by calling cudssCreateMg() to initialize the cuDSS handle for multiple devices. But before you call an answer phase, you moreover must initialize the cuDSS configuration with the device information. Specifically, after making a cuDSS solver configuration object using cudssConfigCreate(), you must set the configuration details for MG mode using cudssConfigSet() for the next:
CUDSS_CONFIG_DEVICE_COUNT, using the arraydevice_count.CUDSS_CONFIG_DEVICE_INDICES, using the arraydevice_indices.
Then you definately use the function cudssDataCreate() to create the solverData for cuDSS and perform the evaluation stage next.
In case you’re using hybrid memory mode, prior to the factorization, it is advisable to set device memory limits for every of the devices individually. That is shown within the code above. Once accomplished, you’ll be able to factorize the matrix and solve the issue.
A highlight of MG mode is that you just don’t must code for communications between the GPUs. cuDSS does all of that for you. Nonetheless, there are some current limitations to using MG mode.
- Using MG mode jointly with multi-GPU-multi-node (MGMN) mode will not be supported (the following section talks about MGMN)
- Distributed input will not be currently supported.
- MG mode will not be supported when either
CUDSS_ALG_1orCUDSS_ALG_2is used for reordering. - MG mode doesn’t support matrix batches.
- All phases in MG mode are synchronous.
- All data should be on the primary device (rank 0) before calling
cudssExecute. Then cuDSS will distribute the info to the opposite devices as needed.
Going big: Multi-GPU Multi-Node (MGMN) mode for distributed power
Now, what if one node isn’t enough and you need to spread your computations across multiple nodes? That’s where MGMN mode is available in. This requires a communication layer that, once added, will assist you to use any or the entire GPUs in a node in addition to multiple nodes—with no limitations. This permits users to resolve massive problems, or to make use of more GPUs to resolve an issue faster.
cuDSS uses an abstraction—a small communication “shim” layer that might be tailored to CUDA-aware Open MPI, NVIDIA NCCL, or perhaps a custom one, when you want.
This example for MGMN has the code for each Open MPI and NCCL. For those who wish to make use of your personal communication layer, there’s an explanation on learn how to try this.
For instance how the communication layer is used, an ifdef code block from the instance is presented within the code block below for each MPI and NCCL code. There are some constants defined during compilation which might be essential for this instance but aren’t shown within the code block. These are USE_MPI and USE_NCCL that outline what code paths are for use.
This ifdef code block is for lines 520-535 within the sample code (these line numbers could change with subsequent versions so check them rigorously).
#ifdef USE_MPI
#if USE_OPENMPI
if (strcmp(comm_backend_name,"openmpi") == 0) {
CUDSS_CALL_AND_CHECK(cudssDataSet(handle, solverData, CUDSS_DATA_COMM,
mpi_comm, sizeof(MPI_Comm*)),
status,
"cudssDataSet for OpenMPI comm");
}
#endif
#if USE_NCCL
if (strcmp(comm_backend_name,"nccl") == 0) {
CUDSS_CALL_AND_CHECK(cudssDataSet(handle, solverData, CUDSS_DATA_COMM,
nccl_comm, sizeof(ncclComm_t*)),
status,
"cudssDataSet for NCCL comm");
}
#endif
#endif
Note that the code changes are mainly minimal for outlining MPI or NCCL. The code differences between the 2 are easy. You need to use your personal communication layer in a really similar manner.
When you define the communicator pointer, passed to cuDSS via CUDSS_DATA_COMM in your code, as shown within the previous code snippet, there is no such thing as a other need to make use of any communication layer function unless your code specifically needs it. cuDSS uses the defined communication layer “under the covers” so that you don’t must code for it. Leaf through the instance code for the way a couple of node is used.
For implementing your personal communication layer, a superb introductory discussion might be present in the cuDSS documentation under advanced topics.
A high-level overview of the communication layers requirements are below:
- The MGMN mode is enabled by abstracting away all communication-specific primitives right into a small, individually built shim communication layer.
- Users can have their very own implementation of the communication layer with the communication backend of their alternative (MPI, NCCL, etc.).
- The enabled MGMN execution in cuDSS doesn’t require any changes for applications that don’t make use of the MGMN mode.
- MGMN mode supports 1D row-wise distribution (with overlapping) for the input CSR matrix, dense righthand side, or solution, using the
cudssMatrixSetDistributedRow1D()function (see the following paragraph).
cuDSS MGMN mode optionally accepts pre-distributed input and may optionally create distributed output. You’ll be able to have each A and B on the rank 0 device. Wherein case, cuDSS will distribute them, or you’ll be able to tell cuDSS how the info is distributed across the devices and nodes with the cudssMatrixSetDistributedRow1d() function.The developer may have to be sure that the info is in the correct location on the correct node and device.
A critical step for good performance is to rigorously select your CPU:GPU:NIC bindings. This will not be discussed here but is documented elsewhere.
There are some current limitations to MGMN mode,
- MGMN mode will not be supported when either
CUDSS_ALG_1 or CUDSS_ALG_2is used for reordering. - MGMN mode doesn’t support matrix batches.
- All phases in MGMN mode are synchronous.
Takeaways
Sparse linear systems appear in lots of disciplines. Fueled by the necessity to resolve real-life problems, the general size of the issues is growing at a really rapid rate. Developers must find ways to resolve them each efficiently and quickly. NVIDIA cuDSS provides a simple to make use of library for solving increasingly massive problems using NVIDIA GPUs.
For more features you need to use with cuDSS, it is suggested that you just read through the advanced feature section of the documents. They contain more information in regards to the features presented here in addition to other capabilities to assist you to solve your large sparse linear problems. There may be also a bit that explains learn how to do logging with cuDSS as you develop your code. That is an important resource since debugging parallel code might be difficult. cuDSS has some great capability for getting log information as your code is executed.
Subscribe to cuDSS on the customer page to stay updated on most up-to-date innovations.
