Higher Bug Detection: How Compile-Time Instrumentation for Compute Sanitizer Enhances Memory Safety

-


CUDA C++ is standard C++ with extensions that enable functions to run on many parallel threads on a GPU. It has facilitated widespread adoption while allowing developers to realize peak performance. Nonetheless, CUDA C++ is just not a memory-safe language. Subtle memory bugs might go undiscovered during development, especially without proper debugging tools.

NVIDIA Compute Sanitizer is a tool to assist developers find bugs of their programs. NVIDIA CUDA 13.1 introduces a brand new compiler option designed to assist Compute Sanitizer’s “memcheck” tool. The brand new compiler option gives developers higher bug coverage and faster execution.

For those who don’t know what Compute Sanitizer is, keep reading to learn the way it may possibly assist you avoid frustration. For those who already use Compute Sanitizer, keep reading to learn the way our recent compiler evaluation might help uncover some stubborn memory errors in your code. 

Uncovering bugs with Compute Sanitizer

Let’s dive in and use Compute Sanitizer to uncover a bug in the next program, an adaptation of an example from a previous blog post on Compute Sanitizer. Can you notice the logical error within the code below? Hint: It’s the classic off-by-one error. We’ve all been there, more times than we’d care to confess. In our code example, we not only launch scaleArray with too many threads, but we also use the mistaken condition in scaleArray’s if-condition. The result’s that thread 512 will access array[512], which is out of bounds.  

#include 
#include   

__global__ void scaleArray(float* array, size_t N, float value) {
  int threadGlobalID    = threadIdx.x + blockIdx.x * blockDim.x;
  if (threadGlobalID <= N) {
    array[threadGlobalID] = array[threadGlobalID]*value;
  }
}

int important() {
  float* array = nullptr;
  float* buffer = nullptr;
  const size_t N = 512;

  // Allocate N float-size elements, visible to each CPU and GPU
  cudaMallocManaged(&array, N*sizeof(float));  
  //cudaMallocManaged(&buffer, N*sizeof(float)); 

  for (int i=0; i>>(array, N, 3.0);
  cudaDeviceSynchronize();

  printf("After : Array 0, 1 .. N-1: %f %f %fn", array[0], array[1], array[N-1]);
  assert(array[N/2] == 3.0); // Check that it has worked

  cudaFree(array);
  cudaFree(buffer);
  exit(0);
}

Let’s see if this bug is observable after we compile and run this system:

$ /usr/local/cuda-13.1/bin/nvcc -o example1 -arch sm_86 example1.cu
$ ./example1
Before: Array 0, 1 .. N-1: 1.000000 1.000000 1.000000
After : Array 0, 1 .. N-1: 3.000000 3.000000 3.000000

On our test machine, this program runs without issue. Bugs just like the one in our example can easily go unnoticed during development, but manifest later as data corruption, segmentation faults, and even code exploits.

Now, let’s run this system through Compute Sanitizer to see if it uncovers the problem.

$ compute-sanitizer ./example1

========= COMPUTE-SANITIZER
Before: Array 0, 1 .. N-1: 1.000000 1.000000 1.000000
========= Invalid __global__ read of size 4 bytes
=========     at scaleArray(float *, unsigned long, float)+0xb0
=========     by thread (512,0,0) in block (0,0,0)
=========     Access to 0x7f7754000800 is out of bounds
=========     and is 1 bytes after the closest allocation at 0x7f7754000000 of size 2,048 bytes
=========     Saved host backtrace as much as driver entry point at kernel launch time
=========         Host Frame: important [0x8deb] in example1
========= 
========= Program hit cudaErrorLaunchFailure (error 719) resulting from "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace as much as driver entry point at error
=========         Host Frame: important [0x8df0] in example1
========= 
After : Array 0, 1 .. N-1: 1.000000 1.000000 1.000000
example1: example1.cu:27: int important(): Assertion `array[N/2] == 3.0' failed.
========= Error: process didn't terminate successfully
========= Goal application returned an error
========= ERROR SUMMARY: 2 errors

Hurray, it does! Nonetheless, Compute Sanitizer, like all commercially available memory sanitizers, can have false negatives. That’s, it could miss some legitimate memory safety errors. Let’s change the instance above by uncommenting the second call to cudaMallocManaged. On our test machine, after recompiling and running this system through compute-sanitizer, we now get this message:

$ compute-sanitizer ./example1

========= COMPUTE-SANITIZER
Before: Array 0, 1 .. N-1: 1.000000 1.000000 1.000000
After : Array 0, 1 .. N-1: 3.000000 3.000000 3.000000
========= ERROR SUMMARY: 0 errors

No errors? What happened? The buffer overrun still exists, however the tool not reports the error. How did that minor change mask the buffer overrun?

The short answer is that Compute Sanitizer’s memcheck tool ensures that memory accesses go to valid memory address ranges, i.e., addresses which were allocated and never freed. During our test, the CUDA runtime’s memory allocator placed array and buffer contiguously in memory. That’s, buffer immediately follows array, and thus an overrun from array to buffer still accesses valid memory. Note that this shortcoming also applies to GCC’s and Clang’s “address sanitizer” tools.   

Let’s understand why Compute Sanitizer operates the way in which it does. Until now, Compute Sanitizer has relied solely on binary instrumentation to inject instrumentation code around memory and synchronization instructions in an application, allowing the tool to find unsound behavior at run time. We’re simplifying the operation, but we will consider the tool inserting a elaborate assertion before each memory reference that checks whether the address is inside a currently-allocated object:

__global__ void scaleArray(float* array, size_t N, float value) {
  int threadGlobalID    = threadIdx.x + blockIdx.x * blockDim.x;
  if (threadGlobalID <= N) {
    assert(isAllocated(&array[threadGlobalID])); // Inserted instrumentation.
    float tmp = array[threadGlobalID]*value;
    assert(isAllocated(&array[threadGlobalID])); // Inserted instrumentation.
    array[threadGlobalID] = tmp;
  }
}

After Compute Sanitizer inserts those assertions, it runs the applying to see whether any of them fire at run time.

Compute Sanitizer didn’t have the luxurious of statically analyzing this system, partly since it only operated on raw binaries, that are notoriously difficult to investigate. Without evaluation, it could only instrument each memory reference independently, in a peephole manner. In our running example, despite the fact that some threads overflow from array to buffer, peephole instrumentation misses the illegal references because buffer is allocated. Subsequently, ‘isAllocated(&array[threadGlobalID]’ is true, and the assertion doesn’t trigger. 

There may be a Compute Sanitizer option, ‘--padding’, that adds invalid memory regions between allocations, allowing Compute Sanitizer to detect the overrun for this instance. But that option is just not on by default because it may possibly significantly increase memory usage. Moreover, the choice cannot pad all memory allocations, including shared memory and global variable allocations. 

We are able to do higher with compile-time instrumentation.

Improving coverage with a compiler evaluation

Starting in CUDA 13.1, Compute Sanitizer can leverage compile-time evaluation and instrumentation to enhance coverage. The result’s fewer false negatives (i.e., bugs we miss) without adding false positives (i.e., non-bugs the tool has misidentified as bugs). 

The goal of our evaluation is to eagerly convert CUDA tips to so-called “fat pointers,” which bundle a pointer together with the pointer’s base and bounds. Pointer arithmetic on fat pointers only modifies the pointer component, leaving the bottom and bounds intact. In this manner, we stand probability of catching an overflow, even when it runs into an adjoining memory allocation. The technique is simply too involved to explain on this blog, but you possibly can read all about it here.

Ignoring compilation time, the overhead of sanitization needs to be lower, as well—in some cases, nearly an order of magnitude lower. That said, a consideration of compile-time instrumentation is that it requires us to spend time recompiling the code.

Sanitizing with compile-time instrumentation is a two-step process:

  1. Compile your program with the -fdevice-sanitize=memcheck flag. This step adds instrumentation code that comprises the logic to confirm your program at runtime. For instance: nvcc -fdevice-sanitize=memcheck -arch sm_86 -o example1 example1.cu.
  2. Run this system under the Compute Sanitizer runtime. This step looks for memory questions of safety at runtime. For instance: compute-sanitizer ./example1.

While compile-time instrumentation performs a static program evaluation, it is crucial to notice that sanitizing is a dynamic process. You’ll only catch bugs if the applying you’re testing happens to execute the buggy portion.

Let’s run the instance program above that earlier eluded Compute Sanitizer, but this time after recompiling with compile-time instrumentation. First, let’s compile this system.

nvcc -o example1 -arch sm_86 -fdevice-sanitize=memcheck example1.cu

Now let’s run this system with Compute Sanitizer.

compute-sanitizer ./example1
========= COMPUTE-SANITIZER
Before: Array 0, 1 .. N-1: 1.000000 1.000000 1.000000
========= Invalid __global__ read of size 4 bytes
=========     at scaleArray(float *, unsigned long, float)+0x150
=========     by thread (512,0,0) in block (0,0,0)
=========     Access to 0x7b3566000800 is out of bounds
=========     and is contained in the nearest allocation at 0x7b3566000800 of size 2,048 bytes
=========     Saved host backtrace as much as driver entry point at kernel launch time
=========         Host Frame: important [0x8e01] in example1
========= 
========= Program hit cudaErrorLaunchFailure (error 719) resulting from "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace as much as driver entry point at error
=========         Host Frame: important [0x8e06] in example1
========= 
After : Array 0, 1 .. N-1: 1.000000 1.000000 1.000000
example1: example1.cu:27: int important(): Assertion `array[N/2] == 3.0' failed.
========= Error: process didn't terminate successfully
========= Goal application returned an error
========= ERROR SUMMARY: 2 errors

This time we caught the bug. The compiler evaluation constrained all accesses to the array pointer to the array object that the kernel launched with, thus stopping overruns into the buffer object.  

Caveats and sharp edges to grasp up front

There are some gotchas to pay attention to when using compiler-based instrumentation.  

Code compiled with “-fdevice-sanitize” is for debugging purposes only and is just not fit for deployment. Your kernels will bloat, use additional registers, and maybe even use some stack.  The additional register usage may prevent some kernels from launching and result in the dreaded message, “too many resources requested for launch.”  To handle this issue, please consider explicitly limiting resources via __launch_bounds__ decorations or the maxregcount compiler flag. 

It’s much less likely (we haven’t seen it yet) but possible in programs with deep call stacks that the instrumentation code’s extra stack usage overflows the stack.  

Compiling your code in debug mode with -G will exacerbate all of the above issues. Our suggestion is to make use of -lineinfo in order that Compute Sanitizer can provide accurate attribution, but to avoid using debug options otherwise.

Undefined behavior can result in unexpected code, and false positives and negatives

Undefined behavior (UB) is just not a problem that’s unique to compile-time instrumentation, and it’s a contentious topic, so don’t shoot the messengers: If there’s UB in your code, the compiler indemnifies itself of any responsibility to generate expected code. Many compiler writers would sleep soundly if their passes turned undefined behavior into the result, “67.” Unfortunately for CUDA, it’s all too easy to write down code with undefined behavior in it. 

We’ve seen cases where the compiler elided memory operations since the programmer constructed pointers via undefined behavior. One recent example we’ve seen involves a “false positive” resulting from pointer arithmetic that first undershoots its object, only to be brought back in bounds before dereferencing. Computing an out-of-bounds pointer type, even when you never dereference it, is undefined behavior. Even in case your application appears to work, it’s in your best interest to remove undefined behavior before the following compiler update gets “creative” along with your code.

Race conditions may cause UB

Race conditions technically belong within the previous section, because in CUDA C++, race conditions are UB, but these are so prevalent in CUDA that race conditions deserve their very own section. Without proper synchronization, the compiler doesn’t must preserve your expected thread and memory orderings.  

After we instrument code by adding a bunch of additional logic and memory requests that occur under the covers, we alter thread schedules. In case your code has latent race conditions, depending on the character of the race, there’s a solid probability that -fdevice-sanitize=memcheck will cause your application to behave unexpectedly. For those who notice your program hanging, crashing, or corrupting memory, please check your code with the racecheck tool in Compute Sanitizer.

HMM memory allocations usually are not supported 

HMM memory is just not currently supported. Our tool currently only tracks allocations performed via the CUDA runtime. In case your application uses HMM, our recent tool will likely generate many false positives. We hope to repair this in a future release.

Conclusion

We wish to conclude with two takeaway messages. First, we urge CUDA developers to run their applications through Compute Sanitizer. CUDA C++ puts lots of responsibility in a developer’s hands. We proceed to search out bugs in professionally authored code using Compute Sanitizer. Even in case your code “works in your machine,” it should still have a bug or two. 

Second, our compile-time approach significantly improves bug coverage and performance of Compute Sanitizer at runtime. If you will have the luxurious of recompiling your codebase, please give it a spin. 



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Previous article
0
Would love your thoughts, please comment.x
()
x