NVIDIA flagship data center GPUs within the NVIDIA Ampere, NVIDIA Hopper, and NVIDIA Blackwell families all feature non-uniform memory access (NUMA) behaviors, but expose a single memory space. Most programs due to this fact would not have a problem with memory non-uniformity. Nonetheless, as bandwidth increases in newer generation GPUs, there are significant performance and power gains available when bearing in mind compute and data locality.
This post first analyzes the memory hierarchy of the NVIDIA GPUs, discussing the ability and performance impacts of information transfer over die-to-die link. It then reviews how one can use NVIDIA Multi-Instance GPU (MIG) mode to attain data localization. Finally, it presents results for running MIG mode versus unlocalized for the Wilson-Dslash stencil operator use case.
Memory hierarchy in NVIDIA GPUs
Consider the abstract view of the memory hierarchy with two NUMA nodes depicted in Figure 1. When a streaming multiprocessor (SM) on node 0 must access a memory location within the dynamic random-access memory (DRAM) of node 1, it must transfer data over the L2 fabric. Within the case of NVIDIA Blackwell GPUs, each NUMA node is a definite physical die, which adds latency and increases the ability required for data transfer. Despite the added complexity, NUMA-unaware code can still achieve peak DRAM bandwidth.


To handle these drawbacks, it is helpful to reduce data transfers between NUMA nodes. When a single memory space is presented to the user, NVIDIA architecture employs coherent caching in L2 to cut back data transfers between NUMA nodes. This mechanism helps prevent repeated accesses to the identical memory address from refetching data over the L2 fabric interface. Ideally, once the address is fetched into the local L2 cache, all subsequent accesses to the identical address will hit the cache.
Before the introduction of coherent caching, the unified L2 cache allowed all SMs to attain peak bandwidth (as in NVIDIA Volta), though latency varied depending on the proximity of the SM to different L2 segments. With the NVIDIA Ampere generation, larger chips introduced a hierarchy of NUMA nodes, each with its own L2 cache and a coherent connection to others.
While large data center GPUs since NVIDIA Ampere architecture have used this design (unlike smaller gaming GPUs), the L2 fabric connection sustains peak bandwidth as mentioned in NVIDIA Blackwell Ultra architecture.
Two challenges have emerged as GPUs proceed to grow: increased latency and power limitations.
- Increased latency: Accessing distant parts of the L2 cache has led to growing latency, which impacts performance, particularly for synchronization.
- Power limitations: On the biggest GPUs, power consumption becomes a limiting factor when tensor cores are energetic. Reducing power consumption through localized L2 access enables decreasing the L2 fabric clock and raising the compute clock through a Dynamic Voltage and Frequency Scaling (DVFS) mechanism related to GPU Boost. In this fashion, tensor core performance might be significantly improved.
MIG reduces data transfers between NUMA nodes. Introduced with the NVIDIA Ampere architecture, this feature enables partitioning a single GPU into multiple instances. Through the use of MIG, developers can create one GPU instance per NUMA node, thereby eliminating accesses over the L2 fabric interface.
This approach does include its own set of costs, including the overhead of communicating between different GPU instances using PCIe. The next section presents results from running workloads using MIG mode and unlocalized memory to exhibit the effectiveness of this approach.
Data localization using MIG
MIG enables supported NVIDIA GPUs to be partitioned into multiple isolated instances, each with dedicated high-bandwidth memory, cache, and compute cores. This allows efficient and high-performance GPU utilization across multiple users or workloads. MIG can achieve as much as 7x more GPU resources on a single GPU. It allows multiple virtual GPUs (vGPUs) and, consequently, virtual machines (VMs) to run in parallel on a single GPU, while providing the isolation guarantees that vGPUs offer.
The capabilities provided by MIG might be leveraged to attain NUMA node localization. By creating one MIG instance per NUMA node, you’ll be able to ensure isolation between different GPU instances. This approach helps eliminate traffic between NUMA nodes.
MIG allows the splitting of the particular GPU into GPU instances (GI), during which a number of compute instances (CIs) are defined. A CI incorporates all (within the case of a single CI per GI) or a portion of the SMs belonging to a GI. To enable localization inside a GI, the concept is to create two GPU instances mapped onto each NUMA node. On a Blackwell GPU, you’ll be able to enable MIG mode and list the available GPU instance profiles, as shown with the code in Figure 2.
Because Blackwell has two NUMA nodes (one per chiplet), search for the profile with essentially the most SMs of which there are two instances. As shown in Figure 2, that is the profile with ID 9, of which there might be two instances. Each instance may have 89 GB and 70 SMs. Using two such instances will lead to only 70×2=140 SMs in total, quite than the total 148 SMs on the device.
At this point, it’s crucial to create a CI in each GPU instance. This might be done using the commands shown in Figure 3. The foremost GPU and the GPU instances now have their very own identifier hash codes. Use those for the 2 NUMA nodes:
MIG 3g.90gb Device 0: (UUID:
MIG-ee2ec0e5-0dda-5591-9ee7-4ae51028b6fa)
MIG 3g.90gb Device 1: (UUID:
MIG-2bbb368b-7cb0-53da-b1a4-7ace0652a197)
To make use of these devices, add them to the CUDA_VISIBLE_DEVICES environment variable. For instance, to run a two-process MPI job, you would create a wrapper script (wrapper.sh):
#!/bin/bash
#
case $SLURM_PROCID in
0)
CUDA_VISIBLE_DEVICES=”MIG-ee2ec0e5-0dda-5591-9ee7-4ae51028b6fa”
;;
1)
CUDA_VISIBLE_DEVICES=”MIG-2bbb368b-7cb0-53da-b1a4-7ace0652a197”
;;
esac
$*
Then launch the MPI jobs:
$ mpirun -n 2 ./wrapper.sh my_executable
Finally, when all of the work is finished, the MIG mode might be turned off.






What are the advantages of localization with MIG?
For instance application to exhibit the advantages of localization with MIG, examine the Wilson-Dslash stencil operator, a key kernel for lattice quantum chromodynamics (LQCD) drawn from the QUDA library. This library is used to speed up several large LQCD codes, akin to Chroma and MILC.
The Dslash kernel is a finite difference operation on a 4D toroidal lattice, where data at each lattice site is updated depending on the values of its eight orthogonal neighbors. The 4 dimensions on this case are the same old spatial dimensions (X, Y, Z) and the time dimension (T). The kernel is memory bandwidth-bound.
If the lattice is decomposed onto two NUMA nodes equally, say along the time axis, then each domain might want to access sites on the T-dimension boundaries of the opposite domain. As shown in Figure 5, green lattice sites on the boundaries of the subdomains need the red sites to finish their stencils. The lattice is notionally laid out onto the 2 NUMA Nodes. Green sites need red-sites to finish their stencils. Possible data paths are regular memory access (black arrows) when unlocalized, or MPI message passing through the host in MIG localized mode (black arrows).


Probably the most convenient method to access neighbors can be through the Shared L2 cache and the interconnect. Nonetheless, when operating in MIG mode this path requires communication between the MIG instances through MPI using PICe or NVLink. Because of this, this path shall be slower in comparison with accessing the foremost memory attached to the MIG instance.
Workloads that require little to no communication between two MIG instances will are likely to profit more using the MIG mode. As an alternative, one packs the black sites on the boundaries and sends them through MPI. This step introduces additional latency (buffer packing, sending, and unpacking). While it saves GPU power by not using the shared L2 cache-to-cache interconnect, it does use power for its transfer through the host (for PCIe, for instance).
The quantity of information that should be transferred between the 2 processes is said to the variety of face sites to be transmitted within the messages, specifically to the surface three-volume orthogonal to the direction of the split. For this instance, the split is at all times within the T-direction, in order that each NUMA node notionally finally ends up with (Ns Nt)/2 sites, where Ns is the number of web sites in our spatial volume and Nt is the length of the time dimension. The surface to volume ratio is Ns/(Ns Nt/2) = 2/Nt. Within the case of the issues, Nt=64 is taken into account and the surface-to-volume ratio stays constant at 1/32 ~ 3.13%.
Figure 6 shows the unlocalized case. The worldwide memory is made up of two memories connected to the NUMA nodes through memory controllers. The coloured highlights on the lattices indicate that data may come from either the local DRAM or from the distant DRAM through the shared L2.


That is to be compared with the baseline case, where MIG is just not employed. Neither the information nor the processing are localized on this case, and the scenario is healthier represented with Figure 8. Each NUMA node receives its data each from its local memory controller and in addition from the opposite NUMA node. Actually, there is just one global lattice and the separation onto two parts for the 2 NUMA nodes within the figure is artificial.
On this scenario, thread blocks to process a set of web sites are assigned to the assorted NUMA nodes purely on the whim of the scheduler. For the reason that data is distributed evenly over the 2 NUMA memories, rather more data is transferred across the shared L2 than within the case of the MIG localization where only the minimally required surface sites were transferred. This will incur a big power cost.
However, the complete operation could also be carried out with a single kernel. Latencies incurred might be avoided by packing buffers for message passing, and accumulating the received faces at the top.
For the experimental results, take a look at the speedup in workload execution with various GPU power limits in watts. The speedups are the ratios of the wallclock times taken by the unlocalized and MIG approaches running at equivalent power limits (for instance, each at 700 W).
As shown in Figure 7, at a GPU power limit of 400 W, MIG outperforms the unlocalized data with speedups of as much as 2.25x depending on the quantity of the workload. The rationale behind that is the ability consumed by the L2 fabric interface becomes a limiting factor when the GPU is running at a low power limit. With MIG mode, since there isn’t any L2 fabric power being consumed to transfer the information between NUMA nodes, workloads can run much faster.
Nonetheless, when the GPU power limit is increased, MIG mode performs barely worse within the case of the experiments represented by the grey, dark green, and black lines in Figure 9, and a part of the green. It’s because at higher power limits, the additional latency included by the message passing can outweigh the advantages of the localization.


Because it seems, the smaller cases (especially those indicated by black and dark green lines in Figure 7) never exhaust available power at higher power limits even within the unlocalized case. As such, they profit little from the GPU power saving won by localization, and at these smaller volumes the latencies as a consequence of kernel launch are rather more noticeable. The larger volumes (the green, for instance) require more power and hence can gain a bonus over the unlocalized setup even at higher power limits.
Start with MIG-based NUMA node localization
Local L2 caching in NVIDIA data center GPUs can impact performance in NUMA-unaware workloads. Our experiments using the Wilson-Dslash operator in MIG mode show that when the GPU is running at lower power limits and data transfer over MPI (PCIe/NVLink) is low relative to local memory accesses, MIG-based NUMA node localization can yield speedups of as much as 2.25x in comparison with the unlocalized case at the identical power limit.
While systems running at a better 1,000 W power envelope may achieve greater absolute performance than a 400 W configuration, MIG-based localization provides clear benefits under power-constrained conditions. In lower-power scenarios, it enables significantly faster performance, making it an especially effective optimization when operating inside strict power limits.
Nonetheless, typically, MIG doesn’t offer the flexibleness required to consistently achieve effective data localization, especially as interprocess communication overhead becomes more pronounced at higher power limits. MIG is just supported to be used cases which are too small to suit on a GPU. Because of this, it is just not beneficial for the cases presented on this post. To handle these limitations, alternative approaches are under investigation.
To learn more, see Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS.
