Streamlining CUB with a Single-Call API

The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation from allocation, might be cumbersome. While this programming model offers flexibility, it often ends in repetitive boilerplate code.

This post explains the shift from this API to the brand new CUB single-call API introduced in CUDA 13.1, which simplifies development by managing memory under the hood without sacrificing performance.

Streamlining CUB with a Single-Call API

What’s CUB?

The present CUB two-phase API

The brand new single-call CUB API

The environment and memory resources

Combining execution options

Start with CUB

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Machine Learning at Scale: Managing More Than One Model in Production

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library

Ulysses Sequence Parallelism: Training with Million-Token Contexts

I Stole a Wall Street Trick to Solve a Google Trends Data Problem

CUDA 13.2 Introduces Enhanced CUDA Tile Support and Recent Python Features

Streamlining CUB with a Single-Call API

What’s CUB?

The present CUB two-phase API

The brand new single-call CUB API

The environment and memory resources

Combining execution options

Start with CUB

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.