Starting with the 25.10 release, pip-installable cuML wheels can now be downloaded directly from PyPI. No more complex installation steps or managing Conda environments—just straightforward pip installation like several other Python package.
The NVIDIA team has been working hard to make cuML more accessible and efficient across the board. One among the largest challenges has been managing the binary size of our CUDA C++ libraries, which affects user experience in addition to the power to pip install from PyPI. Distributing wheels on pypi.org reaches a broader audience and enables users in a company setting to have the wheels available on internal pypi.org mirrors.
PyPI limits binary size to maintain costs for the Python Software Foundation (PSF) under control and protect users from downloading unexpectedly large binaries. The complexity of the cuML library has historically required a bigger binary than PyPI could host, but we’ve worked closely with PSF to beat this by reducing binary size.
This post walks you thru the brand new pip install path for cuML and a tutorial on the steps the team used to drop the CUDA C++ library binary size, which enabled the provision of cuML wheels on PyPI.
Installing cuML from PyPI
To put in cuML from PyPI, use the next commands based in your system’s CUDA version. These packages have been optimized for compatibility and performance.
CUDA 13
Wheel size: ~250 MB
CUDA 12
Wheel size: ~470 MB
How the cuML team reduced binary size by ~30%
By applying careful optimization techniques, the NVIDIA team successfully reduced the CUDA 12 libcuml dynamic shared object (DSO) size from roughly 690 MB to 490 MB—a discount of nearly 200 MB or ~30%.
Smaller binaries provide:
- Faster downloads from PyPI
- Reduced storage requirements for users
- Quicker container builds for deployment
- Lower bandwidth costs for distribution
Reducing binary size required a scientific approach to identifying and eliminating bloat within the CUDA C++ codebase. Later within the post, we share the techniques used to perform this, which might profit any team working with CUDA C++ libraries. We hope that these methods will help library developers manage the dimensions of their binaries and promote the ecosystem of CUDA C++ libraries to maneuver toward more manageable binary sizes.
Why are CUDA binaries so large?
In the event you’ve ever shipped CUDA C++ code as a compiled binary, you’ve likely noticed that these libraries are significantly larger than equivalent C++ libraries offering similar features. CUDA C++ libraries contain quite a few kernels (GPU functions) that form the majority of the binary size. Each kernel instantiation is actually a cross product of:
- All template parameters utilized in the code
- Real GPU architectures that the library supports, compiled in the shape of Real-ISA machine code which is the ultimate binary format used for executing CUDA code
As you add more features and support newer architectures, binary sizes can quickly turn out to be intractable. Even supporting only a single architecture leads to binaries considerably larger than CPU-only libraries with the identical feature set.
Note that the techniques shared here aren’t a panacea for all binary size issues, and so they don’t cover every possible optimization method. We’re highlighting among the higher practices that worked for us at cuML and other RAPIDS libraries like RAFT and cuVS. Be mindful that examples are somewhat general and developers must often consider tradeoffs between binary size and runtime performance.
Understanding CUDA Whole Compilation mode
Before diving into solutions, it’s crucial to know how CUDA compilation works by default.
CUDA C++ libraries are typically compiled in Whole Compilation mode. Which means that every Translation Unit (TU)—that’s, each .cu source file—that directly launches a kernel with triple chevron syntax (kernel<<<...>>>) includes a replica of the kernel. While the usual C++ link process removes duplicate symbols from the ultimate binary, the CUDA C++ link process keeps all copies of a kernel compiled right into a TU.


To ascertain whether there are duplicate kernel instantiations in your DSO, you may run the next command:
cuobjdump -symbols libcuml.so | grep STO_ENTRY | sort -b | uniq -c | sort -gb
Note: While enabling CUDA Separable Compilation can filter out duplicate kernels, it’s not an entire solution. In actual fact, enabling it by default may very well increase binary size and link time in some cases. For more details, see Construct CUDA Software on the Speed of Light.
Removing duplicate kernel instances programmatically
The important thing to solving this problem is to separate the kernel function definition from the declaration, ensuring each kernel is compiled in just one TU. Here’s tips on how to structure this:
Function declaration (kernel.hpp):
namespace library {
void kernel_launcher();
}
Function and kernel compilation in a single TU only (kernel.cu):
#include
__global__ void kernel() {
/// code body
}
void kernel_launcher() {
kernel<<<...>>>();
}
Requesting kernel execution (example.cu):
#include
namespace library {
kernel_launcher();
}
By separating the kernel function definition from the declaration, it’s compiled in a single TU and a launcher construct is used to call it from other TUs. This host-side wrapper is obligatory because adding a kernel function body in a single TU and including the header of the kernel definition in one other TU to launch the kernel shouldn’t be allowed.
In the event you’re shipping a header-only CUDA C++ library or a compiled binary with shared utility kernels using function templates, you face a challenge: function templates are instantiated at the decision site.
Anti-pattern: Implicit template instantiation
Consider a kernel that supports each row-major and column-major 2D array layouts:
namespace library {
namespace {
template
__global__ void kernel_row_major(T* ptr) {
// code body
}
template
__global__ void kernel_col_major(T* ptr) {
// code body
}
}
template
void kernel_launcher(T* ptr, bool is_row_major) {
if (is_row_major) {
kernel_row_major<<<...>>>(ptr);
}
else {
kernel_col_major<<<...>>>(ptr);
}
}
}
This approach provides instances of each kernels to each TU that calls kernel_launcher, no matter whether the user needs each.
Pattern: Explicit template parameters
The answer is to show compile-time information as template parameters:
namespace library {
namespace {
template
__global__ void kernel_row_major(T* ptr) {
// code body
}
template
__global__ void kernel_col_major(T* ptr) {
// code body
}
}
template
void kernel_launcher(T* ptr) {
if constexpr (is_row_major) {
kernel_row_major<<<...>>>(ptr);
}
else {
kernel_col_major<<<...>>>(ptr);
}
}
}
This approach introduces intentionality. If users require each kernel instances, they’ll instantiate them explicitly. Nevertheless, most downstream libraries will generally only need one, significantly reducing binary size.
Note: This method also enables faster compilation and increased runtime performance because of this of compiling the smallest possible type of the kernel function template with a constrained set of template parameters. It also lets you bake-in compile-time optimizations based on instantiated templates.
Optimizing kernel function templates in source files
Even after eliminating duplicate kernel instances, there’s more work to do for enormous kernels with multiple template types.
Anti-pattern: Template parameters for runtime arguments
When compiling binaries, introducing template parameters unnecessarily creates multiple kernel instances. That is the other approach in comparison with writing function templates in header files, where more templates are desirable.
Example (detail/kernel.cuh):
namespace {
template
__global__ void kernel(T* ptr, Lambda lambda) {
lambda(ptr);
}
}
Usage (example.cu):
namespace library {
template
void kernel_launcher(T* ptr) {
if (some_conditional) {
kernel<<<...>>>(ptr, lambda_type_1{});
}
else {
kernel<<<...>>>(ptr, lambda_type_2{});
}
}
}
This approach inevitably creates two instances of the kernel within the precompiled binary.
Pattern: Convert templates to runtime arguments
When writing kernel function templates, at all times ask: “Can this template argument be converted to a runtime argument?” At any time when the reply is yes, refactor as follows:
Definition (detail/kernel.cuh):
enum class LambdaSelector {
lambda_type_1,
lambda_type_2
};
template
struct lambda_type_1 {
void operator()(T* val) {
// do some op
}
};
template
struct lambda_type_2 {
void operator()(T* val) {
// do another op
}
};
namespace {
template
__global__ void kernel(T* ptr, LambdaSelector lambda_selector) {
if (lambda_selector == LambdaSelector::lambda_type_1) {
lambda_type_1{}(ptr);
}
else if (lambda_selector == LambdaSelector::lambda_type_2){
lambda_type_2{}(ptr);
}
}
}
Usage (example.cu):
namespace library {
template
void kernel_launcher(T* ptr) {
if (some_conditional) {
kernel<<<...>>>(ptr, LambdaSelector::lambda_type_1);
}
else {
kernel<<<...>>>(ptr, LambdaSelector::lambda_type_2);
}
}
}
Now just one kernel instance is shipped, directly reducing the binary size to almost half its original size. The impact of converting template arguments to runtime arguments scales with the factor: 1 / (cross product of template instantiations removed).
Note: This method enables faster compilation but may come at the associated fee of some runtime performance on account of added kernel complexity and fewer compile-time optimizations.
Start with cuML on PyPI
We’re excited to bring cuML to PyPI. We hope the techniques shared here will help other teams working with CUDA C++ to attain similar results and, when constructing Python interfaces, share their work on PyPi.
For more suggestions as you construct libraries on CUDA C++, try the updated CUDA Programming Guide. To start with CUDA, see An Even Easier Introduction to CUDA.
