Python dominates machine learning for its ergonomics, but writing truly fast GPU code has historically meant dropping into C++ to write down custom kernels and to take care of bindings back to Python. For many Python developers and researchers, this can be a significant barrier to entry.
Frameworks like PyTorch address this by implementing kernels in CUDA C++—either handwritten or by leveraging libraries just like the NVIDIA CUDA Core Compute Libraries. Handwritten kernels are time-consuming and require deep, low-level architectural expertise. Using CUB, a C++ library inside CCCL, is usually higher, since its primitives are highly optimized per architecture and are rigorously tested. But exposing CUB to Python traditionally means constructing and maintaining bindings and pre-instantiating C++ templates with fixed types and operators—limiting flexibility on the Python side.
The NVIDIA cuda.compute library overcomes these limitations by offering a high-level, Pythonic API for device-wide CUB primitives.
Using cuda.compute helped an NVIDIA CCCL team top the GPU MODE leaderboard, a kernel competition hosted by a web-based community with greater than 20,000 members and a concentrate on learning and improving GPU programming. GPU MODE hosts the kernel competitions to seek out the perfect implementations for a wide range of tasks, from easy vector addition to more complex block matrix multiplications.
The NVIDIA CCCL team focuses on delivering “speed-of-light” (SOL) implementations of parallel primitives across GPU architectures through high-level abstractions. It achieved probably the most first-place finishes overall on the tested GPU architectures: NVIDIA B200, NVIDIA H100, NVIDIA A100, and NVIDIA L4.
On this blog we’ll share more details about how we were capable of place so high on the leaderboard.
CUDA Python: GPU performance meets productivity
CUB offers highly optimized CUDA kernels for common parallel operations, including those featured within the GPU MODE competition. These kernels are architecturally tuned and widely considered near speed-of-light implementations.
The cuda.compute library supports custom types and operators defined directly in Python. Under the hood, it just-in-time (JIT) compiles specialized kernels and applies link-time optimization to deliver near-SOL performance on par with CUDA C++. You stay in Python while getting the flexibleness of templates and the performance of tuned CUDA kernels.
With cuda.compute you get:
- Fast, composable CUDA workflows in Python: Develop efficient and modular CUDA applications directly inside Python.
- Custom data types and operators: Utilize custom data types and operators without the necessity for C++ bindings.
- Optimized performance: Achieve architecture-aware performance through proven CUB primitives.
- Rapid iteration: Speed up development with JIT compilation while maintaining CUDA C++ levels of performance. JIT compilation accelerates the event cycle by providing the flexibleness and rapid iteration cycles that developers need without compromising performance.
The leaderboard results
Using cuda.compute, we submitted entries across GPU MODE benchmarks for PrefixSum, VectorAdd, Histogram, Sort, and Grayscale (search for username Nader).
For algorithms like sort, the CUB implementation was two-to-four times faster than the following best submission. That is the CCCL promise in motion: SOL‑class algorithms that outperform custom kernels for traditional primitives you’d otherwise spend months constructing.
Where we didn’t take first place, the gap typically got here all the way down to us not having a tuning policy for that specific GPU. In some instances, our implementation was a more general solution, while higher-ranked submissions were specialized to specific problem sizes.
In other cases, the primary place submission was already using CUB or cuda.compute under the hood. This underscores that these libraries already represent the performance ceiling for a lot of standard GPU algorithms, and that their performance characteristics are actually well understood and intentionally relied upon by leading submissions.
This isn’t about winning
Leaderboard results are a byproduct; the actual objective is learning with the community, benchmarking transparently, and demonstrating the ability of Python for high-performance GPU work.
Our goal isn’t to discourage hand-written CUDA kernels. There are many valid cases for custom kernels—novel algorithms, tight fusion, or specialized memory access patterns—but for traditional primitives (sort, scan, reduce, histogram, etc.), your first move must be a proven, high-performance implementation. With cuda.compute, those tuned CUB primitives are actually accessible directly from native Python, allowing you to construct high-quality, production-grade, GPU-accelerated Python libraries.
That is great news for anyone constructing the following CuPy, RAPIDS component, or a custom Python GPU accelerated library: faster iteration, fewer glue layers, and production-grade performance all while staying in pure Python.
How cuda.compute looks in practice
One in every of the primary examples any person writes when learning GPU programming is a vector addition. Using cuda.compute we are able to solve this using pure Python by calling a device-wide primitive.
import cuda.compute
from cuda.compute import OpKind
# Construct-time tensors (used to specialize the callable)
build_A = torch.empty(2, 2, dtype=torch.float16, device="cuda")
build_B = torch.empty(2, 2, dtype=torch.float16, device="cuda")
build_out = torch.empty(2, 2, dtype=torch.float16, device="cuda")
# JIT compiling the transform kernel
transform = cuda.compute.make_binary_transform(build_A, build_B, build_out, OpKind.PLUS)
# Defining custom_kernel is required to undergo the GPU MODE competition
def custom_kernel(data):
# Invoking our transform operation on some input data
A, B, out = data
transform(A, B, out, A.numel())
return out
You’ll find more cuda.compute examples on the GPU MODE Leaderboard. The pattern is consistent: easy code with speed-of-light performance, achieved by calling device-wide constructing blocks which are routinely optimized by CCCL for each GPU generation.
Other top-performing submissions for the VectorAdd category required dropping into C++ and inline PTX, leading to code that is extremely architecture-dependent.
Try cuda.compute today
Should you’re constructing Python GPU software, custom pipelines, library components, or performance-sensitive code, cuda.compute gives you the choice to make use of CCCL CUB primitives directly in Python and leverage constructing blocks designed for architecture-aware speed-of-light performance.
To try cuda.compute today, you’ll be able to install it via pip or conda:
pip install cuda-cccl[cu13] (or [cu12])
conda install -c conda-forge cccl-python cuda-version=12 (or 13)
We’re constructing this with the community—your feedback and benchmarks shape our roadmap so don’t hesitate to achieve out to us on Github or within the GPU MODE discord.
