The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation from allocation, might be cumbersome. While this programming model offers flexibility, it often ends in repetitive boilerplate code.
This post explains the shift from this API to the brand new CUB single-call API introduced in CUDA 13.1, which simplifies development by managing memory under the hood without sacrificing performance.
What’s CUB?
If you want to run a normal algorithm (akin to scan, histogram, or sort) on a GPU, CUB is probably going the fastest method to do it. As a principal component of the NVIDIA CUDA Core Compute Libraries (CCCL), CUB is designed to abstract away the complexity of manual CUDA thread management without sacrificing performance.
While libraries like Thrust provide a high-level, “host-side” interface much like the C++ Standard Template Library (STL) for quick prototyping, CUB provides a set of “device-side” primitives. This permits developers to integrate highly optimized algorithms directly into their very own custom kernels. To learn the right way to use CUB, try the NVIDIA DLI course Fundamentals of Accelerated Computing with Modern CUDA C++.
The present CUB two-phase API
CUB is widely really useful for harnessing the total computational capabilities of NVIDIA GPUs. Nevertheless, it carries some intricacies in its usage that will feel non-trivial. This section takes a step back to place these underlying mechanisms in perspective.
A simple, single-pass execution flow is usually assumed, where a single-call to a function primitive suffices to execute the underlying algorithm and retrieve the outcomes right after. The function’s unwanted effects, akin to modifying a variable or returning a result, are expected to be immediately visible to the following statement.
The CUB execution model diverges from this familiar single-pass pattern. Invoking a CUB primitive is a two-step process that requires first calculating the obligatory device memory size (the primary call), and second, explicitly allocating after which executing the kernel (the second call).
The next is a standard CUB call:
// FIRST CALL: determine temporary storage size
cub::DeviceScan::ExclusiveSum(nullptr, temp_storage_bytes, d_input, d_output, num_items);
// Allocate the required temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// SECOND CALL: run the actual scan
cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_input, d_output, num_items);
The CUB interface introduces a practical challenge. The primitives should be invoked twice: first to find out the quantity of temporary memory needed, after which a second time to execute the actual algorithm with the allocated storage.
A big drawback of the normal two-phase API is the shortage of clarity regarding which arguments must remain consistent between the estimation and execution steps. Taking the snippet above for reference, it’s not programmatically clear which parameters influence the inner state and might change between the calls, since the function signatures for each phases are similar. For instance, the d_input and d_output arguments are only actually getting used throughout the second call.
Despite its intricacies, the prevailing design serves the next fundamental purpose: by keeping allocation separated from execution, the user can allocate a bit of memory and reuse it multiple times and even share between different algorithms.
While this design is vital for a non-negligible subset of users, the general user base leveraging this feature is relatively limited. That’s the reason many users wrap their CUB calls, to abstract away the two-step invocation required for each use. PyTorch is a working example, which employs macros to wrap its CUB invocations into single-calls and supply automatic memory management.
The next source code is from the pytorch/pytorch GitHub repo:
// handle the temporary storage and 'twice' calls for cub API
#define CUB_WRAPPER(func, ...) do {
size_t temp_storage_bytes = 0;
AT_CUDA_CHECK(func(nullptr, temp_storage_bytes, __VA_ARGS__));
auto& caching_allocator = *::c10::cuda::CUDACachingAllocator::get();
auto temp_storage = caching_allocator.allocate(temp_storage_bytes);
AT_CUDA_CHECK(func(temp_storage.get(), temp_storage_bytes, __VA_ARGS__));
} while (false)
The usage of macros presents its own drawbacks, as they will obscure control flow and parameter passing, leading to opaque code that’s obscure and significantly hinders debugging.
The brand new single-call CUB API
Given the wide usage of wrappers throughout many production codebases, there’s a recognized need to increase CUB by introducing the brand new single-call API:
// SINGLE CALL: allocation and execution on a single step
cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items);
The instance shows that no explicit memory allocation is required. Note that the allocation process remains to be occurring under the hood, nevertheless. Figure 1 shows that the single-call interface—which incorporates temporary storage estimation, memory allocation, and invoking the algorithm—introduces zero overhead in comparison with the two-phase API.


Figure 1 compares the GPU runtime for the unique two-phase ExclusiveSum call against the newly introduced single-phase call. The x-axis represents multiple input sizes, while the y-axis shows the normalized execution time for every kind of invocation. Two major conclusions drawn from this performance data:
- The brand new API introduces zero overhead
- Memory allocation stays under the brand new API; it just happens under the hood
The second point might be verified by peeking contained in the implementation of the brand new API. Asynchronous allocation is embedded throughout the device primitive:
cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items, env = {}) {
. . .
d_temp_storage = mr.allocate(stream, bytes);
mr.deallocate(stream, d_temp_storage, bytes);
. . .
}
The 2-phase APIs haven’t been removed—those are still valid calls of existing CUB APIs. Quite, the single-phase calls are added on top of existing APIs. It’s expected that nearly all of users will use these.
The environment and memory resources
Beyond resolving the issues previously mentioned, the brand new single-call CUB API also expands the execution configuration capabilities of the invoked primitive. It introduces an environment argument, which might either customize memory allocation using memory resources or simply provide a stream to execute on (just like the two-phase API).
Memory resources are a brand new memory utility for allocating and freeing memory. The environment argument to single-call APIs can optionally contain a memory resource. When a memory resource will not be provided using the environment argument, the API will use a default memory resource provided by CCCL. Conversely, you possibly can elect to pass certainly one of the non-default CCCL memory resources provided as a part of the codebase, and even pass your personal custom memory resource.
// Use CCCL-provided memory resource type
cuda::device_memory_pool mr{cuda::devices[0]};
cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items, mr);
// Create and use your custom MR
my_memory_resource my_mr{cuda::experimental::devices[0]};
// Use it with CUB
cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items, my_mr);
With the brand new API, the execution CUDA stream handling will not be alleviated but relatively encapsulated throughout the latest env variable. In fact it may well even be passed explicitly as before, even when temp allocation handling is removed. CUB now also provides cuda::stream_ref that’s type secure and its usage needs to be preferred. You may as well pass cuda::stream which owns the underlying execution stream.
Combining execution options
The only-call API enables greater than just passing a memory resource or a stream as a final argument. Going forward, the environment argument shall be the place for all execution-related knobs, including deterministic requirements, guarantees, user-defined tunings, and far more.
With the introduction of the single-pass API, CUB has unlocked an enormous suite of execution configuration features. With the plethora of latest execution features, the query becomes: What’s the very best method to mix all of them?
The answer lies in the brand new env argument. By leveraging cuda::std::execution, CUB provides a central endpoint that acts as a versatile “control panel” in your algorithm. As a substitute of rigidly defined function arguments, the environment allows you to create a combinatorial mixture of any features you would like. Whether you desire to pair a custom stream with a particular memory pool, or mix strict deterministic requirements with a custom tuning policy, the env argument handles all of it in a single, type-safe object.
cuda::stream custom_stream{cuda::device_ref{0}};
auto memory_prop = cuda::std::execution::prop{cuda::mr::get_memory_resource,
cuda::device_default_memory_pool(cuda::device_ref{0})};
auto env = cuda::std::execution::env{custom_stream.get(), memory_prop};
DeviceScan::ExclusiveSum(d_input, d_output, num_items, env);
CUB currently provides the next algorithms that support the environment interface, with more to come back:
- cub::DeviceReduce::Reduce
- cub::DeviceReduce::Sum
- cub::DeviceReduce::Min/Max/ArgMin/ArgMax
- cub::DeviceScan::ExclusiveSum
- cub::DeviceScan::ExclusiveScan
For up-to-date progress of the brand new environment based overloads, see the CUB device primitives tracking issue on the NVIDIA/cccl GitHub repo.
Start with CUB
By replacing the verbose two-phase pattern with a streamlined single-call interface, CUB offers a contemporary API that eliminates boilerplate without adding overhead. By leveraging the extensible env argument, you gain a unified control panel to seamlessly mix memory resources, streams, and other facilities. You’re encouraged to adopt this latest standard to simplify your codebase and fully harness the computational power of your GPU. Download CUDA 13.1 or later and begin using these single-call APIs.
