In the event you’re an application developer or a cluster administrator, you’ve likely seen how non-uniform memory access (NUMA) can impact system performance. When an application shouldn’t be fully NUMA-aware, performance could be inconsistent and unpredictable.
Due to these challenges, NVIDIA released the Coherent Driver-based Memory Management (CDMM) mode for the NVIDIA driver for platforms which are hardware-coherent, akin to GH200, GB200 and GB300. CDMM allows the NVIDIA driver, as an alternative of the OS, to manage and manage the GPU memory. This allows far more fine-grained control by the appliance to place data in the suitable memory space and subsequently extract maximum performance.
On this blog we’re going to explain the differences between NUMA and CDMM and the way they will impact application performance. We also published a whitepaper on this topic you can try for much more information.Â
What’s NUMA?
NUMA mode is the present default for the NVIDIA Driver on hardware coherent platforms. NUMA exposes the complete CPU (host) memory and GPU (device) memory to the OS. Which means that standard Linux APIs akin to malloc and mmap, in addition to CUDA APIs, can allocate memory on each the CPU and GPU. It also facilitates dynamic memory migration between CPU and GPU via user space APIs, or mechanically by the kernel to optimize resource utilization. Â
A very important side effect to contemplate, though, is that NUMA mode causes GPU memory to be treated as a generic memory pool, meaning that the power to strictly isolate GPU memory from general OS system functions is restricted. In typical NUMA behavior, memory may spill onto the GPU, which might not be desirable for application performance. Â
That’s why NVIDIA provides an alternate: Coherent Driver-based Memory Management (CDMM) mode.
What are hardware-coherent platforms?
Several NVIDIA systems—including the GH200, GB200, and GB300—contain direct NVLink chip-to-chip (C2C) connections between the CPU and the GPU. That introduces a robust capability not present on PCIe-connected systems: hardware coherent memory. It allows each CPU and GPU memory to be directly addressed from either processor.Â
This may have some unintended consequences for applications that depend on specific behaviors of NUMA. Particularly, the operating system may select GPU memory for unexpected or surprising uses, akin to caching files or avoiding out-of-memory (OOM) conditions from an allocation request. For some applications and workflows, especially those which were optimized for a specific layout of CPU and GPU memory (like Kubernetes), these differences could also be undesirable.Â
The brand new CDMM mode addresses these challenges and might be particularly useful for applications like Kubernetes.
How NUMA impacts Kubernetes
Because Kubernetes is such a ubiquitous strategy to operate large GPU clusters, there are some specific and unexpected behaviors that could be encountered when running Kubernetes in NUMA mode. These behaviors may hurt performance and even application functionality.
- Memory over-reporting: Kubernetes incorrectly includes GPU memory in its system memory count, resulting in pods requesting more memory than available and causing OOM failures.
- Pod memory limits apply to GPU memory, not only system memory: Kubernetes pod memory limits, designed for system memory, incorrectly apply to each system and GPU memory when system-allocated memory is used, as each GPU is exposed as a NUMA node. This breaks the intended Pod spec API contract.Â
- Isolating GPU memory amongst pods: Kubernetes pods, by default, can access all memory across NUMA nodes, including GPU memory. This permits containers to allocate memory on GPUs they don’t have access to, breaking isolation.
For these reasons, we recommend using CDMM mode when using Kubernetes.
What’s CDMM?
CDMM is an alternate operating mode for NVIDIA drivers that forestalls GPU memory from being exposed to the operating system as a software NUMA node. As a substitute, the NVIDIA device driver directly manages GPU memory, separating it from the CPU’s system memory. This approach is inspired by the PCIe-attached GPU model, where GPU memory is distinct from system memory.
In CDMM mode, the CPU memory is managed by the Linux kernel and the GPU memory is managed by the NVIDIA driver. This implies the NVIDIA driver, not the OS, is chargeable for managing the GPU memory and has full control over how the GPU memory is used, thereby offering greater control and sometimes higher application performance.
How CDMM affects CUDA developers
The first impact of CDMM is within the migration of system allocated memory. In the present implementation of CDMM, system allocated memory is not going to be migrated to the GPU. The GPU can still access system allocated memory across the C2C link, but memory pages is not going to be migrated.Â
For instance, when an application uses hints to encourage migration using functions akin to cudaMemPrefetchAsync(), cudaMemPrefetchBatchAsync(),cudaMemDiscardAndPrefetchBatchAsync(), and cudaMemAdvise(SetPreferredLocation), the pages is not going to migrate.
How CDMM affects system administration
When the system is in CDMM mode, there’ll still be NUMA nodes corresponding to the GPUs, but they’ll not present any memory to the OS. Using tools akin to numactl or mbind won’t have any effect when applied to GPU memory. We recommend these tools NOT be utilized in CDMM mode for any GPU memory management. They will still be used to administer system memory. Â
CDMM is currently the default mode for Kubernetes-based GPU operator deployments ranging from Linux driver 580.65.06 and greater. To enable CDMM it’s essential pass a kernel module parameter and value when the motive force is loaded. For the precise command and syntax to enable CDMM mode, please see the CDMM whitepaper.
Guidelines for CDMM and NUMA usage
The next highlights the most important differences between CDMM and NUMA modes, and when to contemplate using one mode or the opposite.
Application-specific memory management
- NUMA mode: Best for applications using OS NUMA APIs and counting on OS management of total system memory (CPU memory + GPU memory).
- CDMM mode: Ideal for applications needing direct GPU memory control, bypassing OS.
Memory pooling
- NUMA mode: Allows GPU and CPU memory to form a bigger single pool. Workloads profit from aggregated memory and bandwidth management.
- CDMM mode: Driver-managed, stopping OS from using GPU memory in a bigger pool. GPU memory is devoted to GPU-specific data.
GPU memory usage: visibility and measurement
- NUMA mode: Standard tools report GPU memory use inside the integrated pool, filterable by NUMA thereby providing overall system memory view.
- CDMM mode: Offers fine-grained control and visibility into GPU memory. Driver-managed GPU memory gives administrators and developers clear understanding of consumption for performance diagnosis and optimization.
Summary
The next table highlights the key differences in how memory is handled between NUMA and CDMM modes.
| NUMA | CDMM | |
| Memory management | OS manages each CPU and GPU | OS manages CPU,NVIDIA driver manages GPU |
| GPU memory exposure | Exposed to OS as generic pool | Not exposed to OS to be used |
| Memory migration | Dynamic migration of system allocated memory between CPU and GPU | System allocated memory NOT migrated to GPU |
By understanding and strategically implementing CDMM, developers and administrators can unlock the complete potential of NVIDIA hardware-coherent memory architectures, ensuring optimal performance and control for his or her GPU-accelerated workloads.
In the event you’re using a hardware-coherent platform akin to GH200, GB200 or GB300, take a have a look at the whitepaper. And consider enabling CDMM mode to permit for fine-grained application control of GPU memory, especially when you’re using Kubernetes.
