Controlling Floating-Point Determinism in NVIDIA CCCL

-


A computation is taken into account deterministic if multiple runs with the identical input data produce the identical bitwise result. While this may increasingly seem to be a straightforward property to ensure, it might be difficult to attain in practice, especially in parallel programming and floating-point arithmetic. It’s because floating-point addition and multiplication aren’t strictly associative—that’s, (a + b) + c may not equal a + (b + c)—attributable to rounding that happens when intermediate results are stored with finite precision.

With NVIDIA CUDA Core Compute Libraries (CCCL) 3.1, CUB—a low-level CUDA library for speed-of-light parallel device algorithms—added a brand new single-phase API that accepts an execution environment, enabling users to customize algorithm behavior. We will use this environment to configure the reduce algorithm’s determinism property. This may only be done through the brand new single-phase API, because the two-phase API doesn’t accept an execution environment.

The next code shows the right way to specify the determinism level in CUB (find the whole example online using compiler explorer).

auto input  = thrust::device_vector{0.0f, 1.0f, 2.0f, 3.0f};
 auto output = thrust::device_vector(1);


 auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed); // could be not_guaranteed, run_to_run (default), or gpu_to_gpu


 auto error = cub::DeviceReduce::Sum(input.begin(), output.begin(), input.size(), env);
 if (error != cudaSuccess)
 {
   std::cerr << "cub::DeviceReduce::Sum failed with status: " << error << std::endl;
 }


 assert(output[0] == 6.0f);

We start by specifying the input and output vectors. We then use cuda::execution::require() to construct a cuda::std::execution::env object, setting the determinism level to not_guaranteed.

There are three determinism levels available for reduction, that are:

  • not_guaranteed
  • run_to_run
  • gpu_to_gpu

Determinism not guaranteed

In floating-point reductions, the result can depend upon the order during which elements are combined. If two runs apply the reduction operator in numerous orders, the ultimate values may differ barely. In lots of applications, these minor differences are acceptable. By relaxing the requirement for strict determinism, the reduction implementation can rearrange the operations in any order, which might improve runtime performance.

In CUB, not_guaranteed relaxes the determinism level. This allows atomic operations—whose unordered execution across threads leads to a distinct order of operations between runs—to compute each the block-level partial aggregates and the ultimate reduction value. All the reduction may also be performed in a single kernel launch, because the atomic operations mix the block-level partial aggregates into the result.

The nondeterministic reduce variant is usually faster than the run-to-run deterministic version—particularly for smaller input arrays, where performing the reduction in a single kernel reduces latency from multiple kernel launches, minimizes extra data movement, and avoids additional synchronization. The tradeoff is that repeated runs may yield barely different results attributable to the dearth of deterministic behavior.

Run-to-run determinism

While nondeterministic reductions offer potential performance gains, CUB also provides a mode that guarantees consistent results across runs. By default, cub::DeviceReduce is run-to-run deterministic, which corresponds to setting the determinism level to run_to_run within the single-phase API. On this mode, multiple invocations with the identical input, kernel launch configuration, and GPU will produce equivalent outputs.

This determinism is achieved by structuring the reduction as a hard and fast, hierarchical tree somewhat than counting on atomics, whose update order can vary across runs. At each stage of the reduction, elements are first combined inside individual threads. The intermediate results are then reduced across threads inside a warp using shuffle instructions, followed by a block-wide reduction using shared memory. Finally, a second kernel aggregates the per-block results to supply the ultimate output. Because this sequence is predetermined and independent of the relative timing of thread execution, the identical inputs, kernel configuration, and GPU yield the identical bitwise result.

GPU-to-GPU determinism

For applications that require the best level of reproducibility, CUB also provides GPU-to-GPU determinism, which guarantees equivalent results across multiple runs with the identical input on different GPUs. This mode corresponds to setting the determinism level to gpu_to_gpu.

To realize this level of determinism, CUB uses a Reproducible Floating-point Accumulator (RFA), an answer based on the NVIDIA GTC 2024 session, Restoring the Scientific Method to HPC: High Performance Reproducible Parallel Reductions. The RFA counters floating-point non-associativity—which arises when adding numbers with different exponents—by grouping all input values into a hard and fast variety of exponent ranges (the default is three bins). This fixed, structured accumulation order ensures the end result is independent of GPU architecture. 

The accuracy of the end result will depend on the variety of bins: more bins provide greater accuracy, but additionally increase the variety of intermediate summations, which might reduce performance. The present implementation defaults the variety of bins to a few, an optimal default providing balanced performance and accuracy. It’s value noting that this configuration will not be just strictly deterministic, but additionally guarantees numerically correct results, providing tighter error bounds than the usual pairwise summation traditionally utilized in parallel reductions.

How results vary based on the determinism levels

The three determinism levels differ in the quantity of variation they produce across multiple runs:

  • Not-guaranteed determinism produces barely different summation values on each invocation.
  • Run-to-run determinism ensures the identical value for each invocation on a single GPU, however the result may vary if a distinct GPU is used.
  • GPU-to-GPU determinism guarantees that the summation value is equivalent for each invocation, no matter which GPU executes the reduction.

That is shown in Figure 1, with the summation of an array for every determinism level—represented by green, blue, and red circles—plotted against the run number. A flat horizontal line shows that the reduction produces the identical result. 

Charts showing how the GPU-to-GPU and run-to-run algorithms produce identical results, but the Not Guaranteed algorithm results vary slightly.Charts showing how the GPU-to-GPU and run-to-run algorithms produce identical results, but the Not Guaranteed algorithm results vary slightly.
Figure 1. Summation value in comparison with run 

Determinism performance comparison

The extent of determinism chosen affects the performance of cub::DeviceReduce. Not-guaranteed determinism, with its relaxed requirements, provides the best performance. The default run-to-run determinism delivers good performance but is barely slower than not-guaranteed determinism. GPU-to-GPU determinism, which enforces the strictest reproducibility across different GPUs, can significantly reduce performance, increasing execution time by 20% to 30% for big problem sizes.

Figure 2 compares the performance of the various determinism requirements for float32 and float64 inputs on an NVIDIA H200 GPU (lower is healthier). They clearly show how the alternative of determinism level impacts execution time across different data types.

Bar graph showing elapsed time compared to number of elements where not guaranteed is always the best performance, followed closely by run-to-run.  GPU-to-GPU is significantly less performant than the other twoBar graph showing elapsed time compared to number of elements where not guaranteed is always the best performance, followed closely by run-to-run.  GPU-to-GPU is significantly less performant than the other two
Figure 2. Elapsed time in comparison with the variety of elements

Conclusion

With the introduction of the single-phase API and explicit determinism levels, CUB provides an enhanced toolbox for controlling each the behavior and performance of reduction algorithms. Users can select the extent of determinism that most closely fits their needs: from the high-performance and versatile, not-guaranteed mode, to the reliable run-to-run default, and as much as the strictest GPU-to-GPU reproducibility.

Determinism in CUB isn’t limited to reductions. We plan to increase these capabilities to additional algorithms for developers to regulate reproducibility across a wider range of parallel CUDA primitives. For updates and discussion, see the continued GitHub issue on expanded determinism support, to follow our roadmap, and supply feedback on algorithms you’d prefer to see deterministic versions of.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x