Reducing CUDA Binary Size to Distribute cuML on PyPI

Starting with the 25.10 release, pip-installable cuML wheels can now be downloaded directly from PyPI. No more complex installation steps or managing Conda environments—just straightforward pip installation like several other Python package.

The NVIDIA team has been working hard to make cuML more accessible and efficient across the board. One among the largest challenges has been managing the binary size of our CUDA C++ libraries, which affects user experience in addition to the power to pip install from PyPI. Distributing wheels on pypi.org reaches a broader audience and enables users in a company setting to have the wheels available on internal pypi.org mirrors.

PyPI limits binary size to maintain costs for the Python Software Foundation (PSF) under control and protect users from downloading unexpectedly large binaries. The complexity of the cuML library has historically required a bigger binary than PyPI could host, but we’ve worked closely with PSF to beat this by reducing binary size.

This post walks you thru the brand new pip install path for cuML and a tutorial on the steps the team used to drop the CUDA C++ library binary size, which enabled the provision of cuML wheels on PyPI.

Reducing CUDA Binary Size to Distribute cuML on PyPI

Installing cuML from PyPI

CUDA 13

CUDA 12

How the cuML team reduced binary size by ~30%

Why are CUDA binaries so large?

Understanding CUDA Whole Compilation mode

Removing duplicate kernel instances programmatically

Anti-pattern: Implicit template instantiation

Pattern: Explicit template parameters

Optimizing kernel function templates in source files

Anti-pattern: Template parameters for runtime arguments

Pattern: Convert templates to runtime arguments

Start with cuML on PyPI

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

Statement from Dario Amodei on our discussions with the Department of War Anthropic

Google quantum-proofs HTTPS by squeezing 2.5kB of information into 64-byte space – Ars Technica

Generative AI, Discriminative Human

Reducing CUDA Binary Size to Distribute cuML on PyPI

Installing cuML from PyPI

CUDA 13

CUDA 12

How the cuML team reduced binary size by ~30%

Why are CUDA binaries so large?

Understanding CUDA Whole Compilation mode

Removing duplicate kernel instances programmatically

Anti-pattern: Implicit template instantiation

Pattern: Explicit template parameters

Optimizing kernel function templates in source files

Anti-pattern: Template parameters for runtime arguments

Pattern: Convert templates to runtime arguments

Start with cuML on PyPI

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.