NVIDIA CUDA developers have access to a wide selection of tools and libraries that simplify development and deployment, enabling users to give attention to the “what” and the “how” of their applications.
An example of that is Multi-Process Service (MPS), where users can get well GPU utilization by sharing GPU resources across processes. Importantly, this will be done transparently as applications don’t need to pay attention to MPS, and no code modifications are needed.
Introducing MLOPart
NVIDIA Blackwell GPUs deliver high bandwidth that’s well-suited to training today’s large language models. Nevertheless, there are cases where applications don’t profit from the total bandwidth of Blackwell and are more latency sensitive.
Memory Locality Optimized Partition (MLOPart) devices are NVIDIA CUDA devices derived from a GPU and optimized for lower latency. MLOPart is a CUDA MPS feature that allows multi-GPU aware applications to see MLOPart devices.
In the actual world, it’s not at all times easy to find out whether an application is latency-bound or bandwidth-bound. MLOPart is designed to be enabled and disabled using the MPS controller and doesn’t require an application to be rewritten. Developers can do easy A/B testing to see if an application advantages from MLOPart.
MLOPart device enumeration
The defining aspect of MLOPart is that when it’s enabled, MLOPart-capable devices appear as multiple distinct CUDA devices, with their very own compute and memory resources. On this sense, it is comparable to an NVIDIA Multi-Instance GPU (MIG). We’ll compare MLOPart with MIG later on this post.
MLOPart creates CUDA devices which are based on the underlying architecture of GPUs. Where possible, CUDA devices are split along boundaries that’d negatively affect memory latency, with either side of the boundary having the memory and compute resources representing an MLOPart device. For Blackwell, the split is along the die boundaries.
If a GPU doesn’t have such boundaries, no MLOPart devices are created, and the GPU is presented to CUDA applications normally. NVIDIA DGX B200 and NVIDIA B300 are able to two MLOPart devices per GPU. This number may change with future architectures, so it’s really useful that developers don’t hardcode assumptions concerning the variety of MLOPart devices that a GPU will support.
MLOPart device capabilities and characteristics
An MLOPart device shares similarities with the underlying device, with a number of notable exceptions. While in principle, developers don’t have to rewrite applications to make use of MLOPart devices, they need to remember that they don’t share all the capabilities and characteristics of the underlying devices.
Capabilities and characteristics shared with the underlying device include:
Compute capability
An MLOPart device has the identical compute capability and might execute the identical GPU binaries because the underlying device. For instance, a tool that supports MLOPart with compute capability 10.0 can have MLOPart devices that even have compute capability 10.0.
Peer-to-peer ability
An MLOPart device will likely be able to the identical peer-to-peer communication because the underlying device. For instance, if two physical devices are connected by NVIDIA NVLink, any MLOPart devices derived from these two underlying devices may even be connected by NVLink.
The exception to this rule is between MLOPart devices belonging to the identical underlying device. On this case, they’re still able to peer-to-peer communication, but don’t require peer-to-peer communication methods comparable to NVLink or PCIe.
When peer devices are MLOPart devices belonging to the identical underlying device, they’re expected to have lower latency and better peer-to-peer bandwidth than peer devices connected through other means.
PCI IDs
MLOPart devices share the identical PCI ID (bus.device.domain) because the underlying device.
Capabilities and characteristics differing from the underlying device include the next.
Streaming multiprocessor count
Each MLOPart device can have fewer streaming multiprocessors (SMs) than the underlying device. Moreover, the entire SMs in all MLOPart devices with a typical shared underlying device could also be fewer than the entire SMs within the underlying device.
MLOPart devices belonging to the identical underlying device have the identical variety of SMs between them, and the variety of SMs is consistent across similar NVIDIA GPUs.
For instance, an NVIDIA HGX B200 system with 8 Blackwell GPUs that normally have 148 SMs will lead to 16 MLOPart devices with 70 SMs each when MLOPart is enabled.
Available memory
MLOPart devices have a partition of the entire memory of the underlying device, and only allocate from that partition, except within the case of CUDA managed memory allocations. Each MLOPart device can have less memory than the underlying device. Each MLOPart device belonging to the identical underlying device has the identical total memory.
In the present version of MLOPart, it’s possible for memory allocated on one MLOPart device to affect the available memory reported by cuMemGetInfo and cudaMemGetInfo on one other MLOPart device from the identical underlying device, although they’ve separate partitions. Future drivers will enable more rigid memory partitions between MLOPart devices.
Virtual address space
MLOPart devices on the identical underlying device share a virtual address space. Which means it’s possible for a buffer overrun of memory allocated on one MLOPart device to deprave memory allocated on one other MLOPart device inside the same process.
Universally unique identifier
Each MLOPart device can have its own universally unique identifier (UUID) that will be queried through CUDA APIs. This will be used to uniquely discover MLOPart devices and to filter available CUDA devices using CUDA_VISIBLE_DEVICES.
Deploying with MLOPart
As with other CUDA MPS features, users can control behavior through MPS controller commands.
The start_server command starts an MPS server. In CUDA 13.1, we introduced the -mlopart choice to this command. This permits users to begin an MPS server that creates MLOPart-enabled MPS clients. As this is completed on a per-server basis, multiple users can have different MLOPart configurations, depending on their needs.
In CUDA 13.0, we introduced the device_query MPS controller command to supply information concerning the CUDA devices enumerated by MPS. After a server has been created, device_query will be used to find out information concerning the devices that’ll be exposed to clients of that server, comparable to the device name, device ordinals, and UUIDs.
$ echo device_query | nvidia-cuda-mps-control
Default
Device Ordinal PCI IDs UUID Name Attributes
0 0000:1b.00.00 GPU-ebebf640-14d4-de34-f16e-a5e7da272ac4 NVIDIA B200
1 0000:43.00.00 GPU-6d3a75da-dd2e-173e-e797-c0b8ed47a100 NVIDIA B200
2 0000:52.00.00 GPU-a517c26e-0f2f-945a-1672-ea75149f54d6 NVIDIA B200
3 0000:61.00.00 GPU-999b1bd5-82d8-3db2-e2ec-fdae5d1103b1 NVIDIA B200
4 0000:9d.00.00 GPU-b5830513-614b-38ac-b177-5cc2f850ea3d NVIDIA B200
5 0000:c3.00.00 GPU-05f3779e-bfa6-f9c8-256f-6cee98b8871d NVIDIA B200
6 0000:d1.00.00 GPU-2facdb95-1af2-26e3-2c9d-e02f4651675d NVIDIA B200
7 0000:df.00.00 GPU-7e555b40-ffe0-e066-4db3-4ddd96344f0d NVIDIA B200
Server 14056
Device Ordinal PCI IDs UUID Name Attributes
N/A 0000:1b.00.00 GPU-ebebf640-14d4-de34-f16e-a5e7da272ac4 NVIDIA B200 M
0 0000:1b.00.00 GPU-1bd9c0d8-c86a-5a37-acee-411ebcef5fd0 NVIDIA B200 MLOPart 0 MD
1 0000:1b.00.00 GPU-58e7f54c-f60f-56b7-a4c4-b3fb418fde3e NVIDIA B200 MLOPart 1 MD
N/A 0000:43.00.00 GPU-6d3a75da-dd2e-173e-e797-c0b8ed47a100 NVIDIA B200 M
2 0000:43.00.00 GPU-68fb01e9-499c-56d4-b768-8fca70a5ddff NVIDIA B200 MLOPart 0 MD
3 0000:43.00.00 GPU-6cf0c4ea-3a05-52b1-aec6-63acf60df19b NVIDIA B200 MLOPart 1 MD
N/A 0000:52.00.00 GPU-a517c26e-0f2f-945a-1672-ea75149f54d6 NVIDIA B200 M
4 0000:52.00.00 GPU-dd670b14-ca31-5dfd-a49b-7220701f4fc6 NVIDIA B200 MLOPart 0 MD
5 0000:52.00.00 GPU-d7433996-1714-5baa-9812-22cecdc792d3 NVIDIA B200 MLOPart 1 MD
N/A 0000:61.00.00 GPU-999b1bd5-82d8-3db2-e2ec-fdae5d1103b1 NVIDIA B200 M
6 0000:61.00.00 GPU-cff5ab0b-a509-54c8-a9c0-c5ebe3fbd3a0 NVIDIA B200 MLOPart 0 MD
7 0000:61.00.00 GPU-7933cfe7-5139-50d8-ad90-0f7f1ddba559 NVIDIA B200 MLOPart 1 MD
N/A 0000:9d.00.00 GPU-b5830513-614b-38ac-b177-5cc2f850ea3d NVIDIA B200 M
8 0000:9d.00.00 GPU-f973284b-7385-576b-80d7-3ea083bcea94 NVIDIA B200 MLOPart 0 MD
9 0000:9d.00.00 GPU-668e4145-b221-5495-a3fe-a5cdc0e6f6eb NVIDIA B200 MLOPart 1 MD
N/A 0000:c3.00.00 GPU-05f3779e-bfa6-f9c8-256f-6cee98b8871d NVIDIA B200 M
10 0000:c3.00.00 GPU-53858feb-87eb-5963-8d47-6fbf4b24cd4a NVIDIA B200 MLOPart 0 MD
11 0000:c3.00.00 GPU-700b029a-be98-5d13-9a4e-5e8e21386e34 NVIDIA B200 MLOPart 1 MD
N/A 0000:d1.00.00 GPU-2facdb95-1af2-26e3-2c9d-e02f4651675d NVIDIA B200 M
12 0000:d1.00.00 GPU-563db4f2-f70a-564d-aa4a-dbd52d6dfc0b NVIDIA B200 MLOPart 0 MD
13 0000:d1.00.00 GPU-b643e07a-6eda-5cd8-bdde-1788590d0b4b NVIDIA B200 MLOPart 1 MD
N/A 0000:df.00.00 GPU-7e555b40-ffe0-e066-4db3-4ddd96344f0d NVIDIA B200 M
14 0000:df.00.00 GPU-f8f5b46d-7774-57a1-97d2-88f23c3457f0 NVIDIA B200 MLOPart 0 MD
15 0000:df.00.00 GPU-46d7f9b7-0303-5432-b50a-16381f37e365 NVIDIA B200 MLOPart 1 MD
When MLOPart is enabled, device_query shows the MLOPart devices below the device from which they’re derived. That is the really useful method for determining UUID values used for CUDA_VISIBLE_DEVICES when launching an application. As CUDA will enumerate more devices than actually exist on the system, there’s ambiguity within the device enumeration.
Note that MLOPart devices only exist within the context of MPS and CUDA. nvidia-smi doesn’t provide details about MLOPart devices.
Lastly, the ps MPS controller command has been prolonged to display whether a process is using an MLOPart device.
$ while1 -a &
[1] 52845
$ echo ps | nvidia-cuda-mps-control
PID ID SERVER DEVICE NAMESPACE COMMAND ATTRIBUTES
52845 1 52837 GPU-b13add01-c28c 4026531836 while1 MD
MLOPart in use
Now let’s take a look at how MLOPart can affect memory latency and bandwidth.
Latency
For example, let’s take a look at how MLOPart affects memory latency using a straightforward kernel that does some atomic operations in a loop.
First, we define the kernel and a helper:
#include
#include
#include
// Helper macro to ascertain for CUDA errors
#define CUDA_CHECK_FAILURE(x)
if (cudaSuccess != (cudaError_t)x)
{
const char* errName = cudaGetErrorName(x);
const char* errStr = cudaGetErrorString(x);
printf("%s:%d - %s: %sn", __FILE__, __LINE__, errName, errStr);
exit(EXIT_FAILURE);
}
// Device memory variable to make use of to stop the compiler from optimizing away the memory access
__device__ volatile int dummy;
// Trivial kernel to the touch the memory so we are able to measure latency
__global__ void accessMemoryHighLatency(int *startAddress, size_t memorySizeInBytes) {
for (int i = 0 ; i < memorySizeInBytes / sizeof(int) ; ++i) {
dummy = atomicAdd(&startAddress[i], 1);
}
}
Atomic operations are latency-sensitive, making it easy to measure the difference between using and never using MLOPart. The next is a function that uses CUDA events to measure the runtime of the kernel accessMemoryHighLatency.
// Function to launch the kernel and measure the runtime using CUDA events
float measureKernelRuntime(int *memoryDevPtr, size_t memorySizeInBytes, int numBlocks, int numThreads) {
cudaEvent_t start = NULL, stop = NULL;
float time = 0;
CUDA_CHECK_FAILURE(cudaEventCreate(&start));
CUDA_CHECK_FAILURE(cudaEventCreate(&stop));
CUDA_CHECK_FAILURE(cudaEventRecord(start, 0));
accessMemoryHighLatency<<>>(memoryDevPtr, memorySizeInBytes);
CUDA_CHECK_FAILURE(cudaPeekAtLastError());
CUDA_CHECK_FAILURE(cudaEventRecord(stop, 0));
CUDA_CHECK_FAILURE(cudaEventSynchronize(stop));
CUDA_CHECK_FAILURE(cudaEventElapsedTime(&time, start, stop));
CUDA_CHECK_FAILURE(cudaEventDestroy(start));
CUDA_CHECK_FAILURE(cudaEventDestroy(stop));
return time;
}
Finally, we are able to put this all together by creating a straightforward multi-GPU-aware program.
int most important(int argc, char *argv[]) {
size_t memorySizeInBytes = 32 * 1024 * 1024; // 32 MB
int numBlocks = 32;
int numThreads = 1;
int numDevices = 0;
float totalTime = 0;
CUDA_CHECK_FAILURE(cudaGetDeviceCount(&numDevices));
// Measure the runtime for every device
for (int i = 0; i < numDevices; i++) {
// Set the present device
CUDA_CHECK_FAILURE(cudaSetDevice(i));
// Allocate memory on the device
int *memoryDevPtr;
CUDA_CHECK_FAILURE(cudaMalloc(&memoryDevPtr, memorySizeInBytes));
// Measure the runtime
float time = measureKernelRuntime(memoryDevPtr, memorySizeInBytes, numBlocks, numThreads);
totalTime += time;
printf("Device %d - Total time: %f millisecondsn", i, time);
// Free the memory
CUDA_CHECK_FAILURE(cudaFree(memoryDevPtr));
}
printf("Average time: %f millisecondsn", totalTime / numDevices);
return EXIT_SUCCESS;
}
We’ll name this file atomic_memory_access.cu and compile it using nvcc atomic_memory_access.cu -arch=sm_100 -o atomic_memory_access.
To ascertain a baseline, let’s run the instance using MPS, but without MLOPart.
$ nvidia-cuda-mps-control -d
# Optional step of explicitly creating an MPS server. This can also be done implicitly once we launch a CUDA application while MPS is energetic.
$ echo start_server -uid $UID | nvidia-cuda-mps-control
$ ./atomic_memory_access
Device 0 - Total time: 2320.550537 milliseconds
Device 1 - Total time: 2323.710938 milliseconds
Device 2 - Total time: 2334.533447 milliseconds
Device 3 - Total time: 2304.551025 milliseconds
Device 4 - Total time: 2304.328125 milliseconds
Device 5 - Total time: 2316.102295 milliseconds
Device 6 - Total time: 2306.165283 milliseconds
Device 7 - Total time: 2306.362061 milliseconds
Average time: 2314.537842 milliseconds
Here we see a median time of around 2,300 milliseconds for every device. Now let’s enable MLOPart and run it again.
# Quit the MPS controller to cleanup the previous server.
$ echo quit | nvidia-cuda-mps-control
# Now repeat the above steps, with MLOPart enabled.
$ nvidia-cuda-mps-control -d
# Note that we must explicitly start the server with "-mlopart".
$ echo start_server -uid $UID -mlopart | nvidia-cuda-mps-control
$ ./atomic_memory_access
Device 0 - Total time: 1500.194946 milliseconds
Device 1 - Total time: 1475.914062 milliseconds
Device 2 - Total time: 1479.729492 milliseconds
Device 3 - Total time: 1480.196045 milliseconds
Device 4 - Total time: 1478.959106 milliseconds
Device 5 - Total time: 1490.808716 milliseconds
Device 6 - Total time: 1468.943237 milliseconds
Device 7 - Total time: 1479.297241 milliseconds
Device 8 - Total time: 1467.947632 milliseconds
Device 9 - Total time: 1476.900757 milliseconds
Device 10 - Total time: 1477.081421 milliseconds
Device 11 - Total time: 1490.295044 milliseconds
Device 12 - Total time: 1484.558594 milliseconds
Device 13 - Total time: 1481.660156 milliseconds
Device 14 - Total time: 1476.067383 milliseconds
Device 15 - Total time: 1484.143921 milliseconds
Average time: 1480.793457 milliseconds
In this instance, we see a big improvement in execution time per device when using MLOPart. While this was a contrived example, it’s vital to check running with and without MLOPart when deciding methods to deploy a selected application.
Bandwidth
Provided that MLOPart devices have less memory than a full device, in addition they have lower DRAM bandwidth than devices not using MLOPart.
MLOPart devices have higher peer-to-peer bandwidth between MLOPart devices on the identical underlying GPU when put next to devices that must communicate using NVLink or PCIe.
Let’s take a look at the (partial) results of a bidirectional P2P bandwidth test between MLOPart devices on the identical underlying device and never on the identical underlying device:
$ ./nvbandwidth -t device_to_device_memcpy_read_ce
...
Running device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4
0 N/A 2352.76 766.82 743.46 767.51
1 2402.78 N/A 765.86 744.04 767.03
2 767.23 744.30 N/A 2349.54 766.00
3 767.37 743.91 2372.91 N/A 767.30
4 766.75 743.52 766.89 743.97 N/A
Within the above example, devices 0 and 1 are on the identical underlying GPU, and devices 2 and three are on the identical underlying GPU.
Within the case of B200, peers normally use NVLink when initiating an operation comparable to cuMemcpyAsync. If these B200 peers are MLOPart devices on the identical B200 chip, they’ll as an alternative use the much faster NV-HBI.
Considerations when using MLOPart
As mentioned previously, using MLOPart implies selecting lower latency over higher bandwidth. This isn’t the one tradeoff that have to be evaluated when using MLOPart.
Device filtering through CUDA_VISIBLE_DEVICES
The devices available to MPS servers and clients will be filtered and/or remapped using the CUDA_VISIBLE_DEVICES environment variable. Often, this is completed using device ordinals. With MPS, this could cause errors if the identical value CUDA_VISIBLE_DEVICES is used for each the controller and server/clients if remapping isn’t taken under consideration.
For instance, given a system with 8 CUDA devices, the MPS controller will be initialized to filter out the odd-numbered devices (CUDA_VISIBLE_DEVICES=0,2,4,6). On this scenario, the MPS server and clients will only see at most 4 CUDA devices, even without using CUDA_VISIBLE_DEVICES. Using the identical value for CUDA_VISIBLE_DEVICES will fail since we are able to only see devices 0-3. Because of this, it’s really useful to make use of UUIDs, that are unambiguous.
When MLOPart is enabled, there’s an extra inconsistency to pay attention to. UUIDs of the devices visible to the MPS controller and an MPS server/client with MLOPart enabled are different. When using CUDA_VISIBLE_DEVICES, it’s really useful to execute the device_query command after the MPS server with MLOPart has been began to find out the UUIDs that will likely be available to MPS clients.
Fewer compute resources
When MLOPart is enabled, the MLOPart devices can have some SMs disabled. There’s a tradeoff between performance gains from reduced memory latency and performance losses from fewer compute resources. These must be weighed on a per-application basis.
Managed memory
Managed memory doesn’t profit from MLOPart. As MLOPart requires creating GPU memory for low-latency allocations, this could’t be done with managed memory. Attempting to make use of managed memory will work because it normally does, and allocations can still be created using managed memory APIs, but they aren’t expected to see performance advantages.
Access modifiers
The cuMemSetAccess API enables programmers to specify access properties for CUDA allocations. When using this API with respect to MLOPart devices, the least restrictive property set on all MLOPart devices belonging to the identical underlying GPU is applied. For instance, setting a buffer as read-only for one MLOPart device and read-write (default) for one more MLOPart device leads to each MLOPart devices having read-write access, until each are updated to a more restrictive access type.
x86 requirement
MLOPart is currently only supported on x86 platforms. Support for ARM platforms will likely be available in a future release.
Comparison to MIG
MIG will be used to create multiple CUDA devices from a single GPU, as is completed with MLOPart. Certain MIG configurations can even reduce latency at the associated fee of bandwidth, while requiring no code changes.
| Topic | MIG | MLOPart / MPS |
| Privilege required | Requires superuser privilege to configure | Doesn’t require superuser privilege |
| Scope | System-wide setting | Per-user / per-server setting |
| Memory isolation | Enforces strict memory isolation between MIG GPU instances | Memory from one MLOPart device may corrupt one other on the identical GPU |
| Performance isolation | Enforces strict performance isolation between MIG compute instances | Performance interference may occur between MLOPart devices |
To learn more about MLOPart, CUDA MPS, and methods to maximize GPU utilization, try the MPS documentation.
Acknowledgements: Due to the next NVIDIA contributors: Alfred Barnat, Ehren Bendler, Alicia Hu, Balint Joo, Ze Long, Yashwant Marathe, Vance Miller, Kyrylo Perelygin, Will Pierce, Yifan Yang
