Controlling Floating-Point Determinism in NVIDIA CCCL

A computation is taken into account deterministic if multiple runs with the identical input data produce the identical bitwise result. While this may increasingly seem to be a straightforward property to ensure, it might be difficult to attain in practice, especially in parallel programming and floating-point arithmetic. It’s because floating-point addition and multiplication aren’t strictly associative—that’s, (a + b) + c may not equal a + (b + c)—attributable to rounding that happens when intermediate results are stored with finite precision.

With NVIDIA CUDA Core Compute Libraries (CCCL) 3.1, CUB—a low-level CUDA library for speed-of-light parallel device algorithms—added a brand new single-phase API that accepts an execution environment, enabling users to customize algorithm behavior. We will use this environment to configure the reduce algorithm’s determinism property. This may only be done through the brand new single-phase API, because the two-phase API doesn’t accept an execution environment.

The next code shows the right way to specify the determinism level in CUB (find the whole example online using compiler explorer).

auto input  = thrust::device_vector{0.0f, 1.0f, 2.0f, 3.0f};
 auto output = thrust::device_vector(1);


 auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed); // could be not_guaranteed, run_to_run (default), or gpu_to_gpu


 auto error = cub::DeviceReduce::Sum(input.begin(), output.begin(), input.size(), env);
 if (error != cudaSuccess)
 {
   std::cerr << "cub::DeviceReduce::Sum failed with status: " << error << std::endl;
 }


 assert(output[0] == 6.0f);

We start by specifying the input and output vectors. We then use cuda::execution::require() to construct a cuda::std::execution::env object, setting the determinism level to not_guaranteed.

There are three determinism levels available for reduction, that are:

not_guaranteed
run_to_run
gpu_to_gpu

Controlling Floating-Point Determinism in NVIDIA CCCL

Determinism not guaranteed

Run-to-run determinism

GPU-to-GPU determinism

Determinism performance comparison

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

AI in Multiple GPUs: ZeRO & FSDP

Trump gets data center corporations to pledge to pay for power generation

NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance

Introducing Modular Diffusers – Composable Constructing Blocks for Diffusion Pipelines

Dataset Recording, VLA High quality‑Tuning, and On‑Device Optimizations

Controlling Floating-Point Determinism in NVIDIA CCCL

Determinism not guaranteed

Run-to-run determinism

GPU-to-GPU determinism

Determinism performance comparison

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.