Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

Python dominates machine learning for its ergonomics, but writing truly fast GPU code has historically meant dropping into C++ to write down custom kernels and to take care of bindings back to Python. For many Python developers and researchers, this can be a significant barrier to entry.

Frameworks like PyTorch address this by implementing kernels in CUDA C++—either handwritten or by leveraging libraries just like the NVIDIA CUDA Core Compute Libraries. Handwritten kernels are time-consuming and require deep, low-level architectural expertise. Using CUB, a C++ library inside CCCL, is usually higher, since its primitives are highly optimized per architecture and are rigorously tested. But exposing CUB to Python traditionally means constructing and maintaining bindings and pre-instantiating C++ templates with fixed types and operators—limiting flexibility on the Python side.

The NVIDIA cuda.compute library overcomes these limitations by offering a high-level, Pythonic API for device-wide CUB primitives.

Using cuda.compute helped an NVIDIA CCCL team top the GPU MODE leaderboard, a kernel competition hosted by a web-based community with greater than 20,000 members and a concentrate on learning and improving GPU programming. GPU MODE hosts the kernel competitions to seek out the perfect implementations for a wide range of tasks, from easy vector addition to more complex block matrix multiplications.

The NVIDIA CCCL team focuses on delivering “speed-of-light” (SOL) implementations of parallel primitives across GPU architectures through high-level abstractions. It achieved probably the most first-place finishes overall on the tested GPU architectures: NVIDIA B200, NVIDIA H100, NVIDIA A100, and NVIDIA L4.

On this blog we’ll share more details about how we were capable of place so high on the leaderboard.

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

CUDA Python: GPU performance meets productivity

The leaderboard results

This isn’t about winning

How `cuda.compute` looks in practice

Try cuda.compute today

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

One-Shot Any Web App with Gradio’s gr.HTML

Google DeepMind desires to know if chatbots are only virtue signaling

Use Lyria 3 to create music tracks within the Gemini app

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

CUDA Python: GPU performance meets productivity

The leaderboard results

This isn’t about winning

How cuda.compute looks in practice

Try cuda.compute today

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

How `cuda.compute` looks in practice

What are your thoughts on this topic?
Let us know in the comments below.