Compression is a typical technique to cut back storage costs and speed up input/output transfer times across databases, data-center communications, high-performance computing, deep learning, and more. But decompressing that data often introduces latency and consumes invaluable compute resources, slowing overall performance.
To handle these challenges, NVIDIA introduced the hardware Decompression Engine (DE) within the NVIDIA Blackwell architecture—and paired it with the nvCOMP library. Together, they offload decompression from general-purpose compute, speed up widely used formats like Snappy, and make adoption seamless.
This blog will walk through how DE and nvCOMP work, the usage guidelines, and the performance advantages they unlock for data-intensive workloads.
How the Decompression Engine works
The brand new DE within the Blackwell architecture is a fixed-function hardware block designed to speed up decompression of Snappy, LZ4, and Deflate-based streams. By handling decompression in hardware, the DE frees up invaluable streaming multiprocessor (SM) resources for compute, relatively than burning cycles on data movement.
Integrated as a part of the copy engine, the DE eliminates the necessity for sequential host-to-device copies followed by software decompression. As an alternative, compressed data could be transferred directly across PCIe or C2C and decompressed in transit, reducing a serious I/O bottleneck.
Beyond raw throughput, the DE enables true concurrency of information movement and compute. Multi-stream workloads can issue decompression operations in parallel with SM kernels, keeping the GPU fully utilized. In practice, this implies data-intensive applications corresponding to training LLMs, analyzing massive genomics datasets, or running HPC simulations can keep pace with the bandwidth of next-generation Blackwell GPUs without stalling on I/O.
The advantages of nvCOMP’s GPU-accelerated compression
The NVIDIA nvCOMP library provides GPU-accelerated compression and decompression routines. It supports a wide selection of ordinary formats, together with formats that NVIDIA has optimized for the most effective possible GPU performance.
Within the case of ordinary formats, CPUs and glued function hardware ceaselessly have architectural benefits over the GPU due to limited parallelism available. The decompress engine is our solution to this problem for a variety of workloads. The next sections will discuss further learn how to leverage nvCOMP to make use of DE.
The way to use DE and nvCOMP
It’s best for developers to leverage DE through nvCOMP APIs. For the reason that DE is barely available on chosen GPUs (as of now, the B200, B300, GB200, and GB300), using nvCOMP enables developers to put in writing portable code that scales and works across GPUs because the DE footprint evolves over time. When the DE is on the market, nvCOMP will make use of it without changes to user code. If not, nvCOMP will fall back to its accelerated SM-based implementations.
There are just a few things you should do to make sure this behavior on DE-enabled GPUs. nvCOMP generally allows input and output buffers of any type which are accessible to the device. The DE has stricter requirements. In case your buffers don’t meet these requirements, nvCOMP can even execute the decompress on SM. See Table 1 for an outline of the allowed allocation types and their intended usages.
| cudaMalloc | Standard device-only allocation | Device |
| cudaMallocFromPoolAsync | Easy-to-use pool-based allocations with more | Host/device |
| cuMemCreate | Low-level control of host/device allocations | Host/device |
cudaMalloc allocations could be allocated as normal for device-to-device decompression. Host-to-device and even host-to-host decompression is feasible if using cudaMallocFromPoolAsync or cuMemCreate, but care have to be taken to establish the allocators properly.
The next section will provide worked examples of learn how to use these different allocators. Note that in each cases, the one difference in standard use of those APIs is the addition of the cudaMemPoolCreateUsageHwDecompress and CU_MEM_CREATE_USAGE_HW_DECOMPRESS flags. In each examples, these allocations are placed on the primary CPU NUMA node.
Using cudaMallocFromPoolAsync
The code example below shows learn how to create a pinned host memory pool with the cudaMemPoolCreateUsageHwDecompress flag, enabling allocations compatible with the DE.
cudaMemPoolProps props = {};
props.location.type = cudaMemLocationTypeHostNuma;
props.location.id = 0;
props.allocType = cudaMemAllocationTypePinned;
props.usage = cudaMemPoolCreateUsageHwDecompress;
cudaMemPool_t mem_pool;
CUDA_CHECK(cudaMemPoolCreate(&mem_pool, &props));
char* mem_pool_ptr;
CUDA_CHECK(cudaMallocFromPoolAsync(&mem_pool_ptr, 1024, mem_pool, stream));
Using cuMemCreate
This instance demonstrates learn how to use the low-level CUDA driver API (cuMemCreate) to allocate pinned host memory with the CU_MEM_CREATE_USAGE_HW_DECOMPRESS flag. It ensures the buffer is compatible with the DE.
CUdeviceptr mem_create_ptr;
CUmemGenericAllocationHandle allocHandle;
CUmemAllocationProp props = {};
props.location.type = CU_MEM_LOCATION_TYPE_HOST_NUMA;
props.location.id = 0;
props.type = CU_MEM_ALLOCATION_TYPE_PINNED;
props.allocFlags.usage = CU_MEM_CREATE_USAGE_HW_DECOMPRESS;
size_t granularity;
CU_CHECK(cuMemGetAllocationGranularity(&granularity, &props, CU_MEM_ALLOC_GRANULARITY_MINIMUM));
// Create the allocation handle
CU_CHECK(cuMemCreate(&allocHandle, granularity, &props, 0));
// Reserve virtual address space
CU_CHECK(cuMemAddressReserve(&mem_create_ptr, granularity, 0, 0, 0));
// Map the physical memory to the virtual address
CU_CHECK(cuMemMap(mem_create_ptr, granularity, 0, allocHandle, 0));
Best practices for buffer batching
For best performance, the batch of buffers used for decompression (input/output/sizes) ought to be pointers which are offset into the identical allocations. If providing a batch of buffers from different allocations, host driver launch overhead could be significant.
uint8_t* d_decompressed_buffer;
CUDA_CHECK(cudaMalloc(&d_decompressed_buffer, total_decompressed_size));
// Create pinned host arrays for device decompression pointers
uint8_t** h_d_decompressed_ptrs;
CUDA_CHECK(cudaHostAlloc(&h_d_decompressed_ptrs, actual_num_buffers * sizeof(uint8_t*), cudaHostAllocDefault));
// Fill the pinned host pointer arrays for device decompression using offsets
size_t decompressed_offset = 0;
for (int i = 0; i < actual_num_buffers; ++i) {
h_d_decompressed_ptrs[i] = d_decompressed_buffer + decompressed_offset;
decompressed_offset += input_sizes[i]
}
Note that as a result of synchronization requirements related to the DE, nvCOMP’s asynchronous APIs will synchronize with the calling stream. Generally, nvCOMP will still return before the API finishes, so that you’ll still must synchronize the calling stream again before using the results of decompression if decompressing to the host. For device-side access, the decompress result is on the market in normal stream-ordering.
On B200, if any buffer is larger than 4 MB, nvCOMP will fall back to an SM-based implementation. This limit might change in the longer term, and could be queried by the next code:
int max_supported_size = 0;
res = CudaDriver::cuDeviceGetAttribute(&max_supported_size,
CU_DEVICE_ATTRIBUTE_MEM_DECOMPRESS_MAXIMUM_LENGTH,
device_id);
How SM performance compares to DE
DE provides faster decompression while freeing the SM for other work. The DE provides dozens of execution units in comparison with the hundreds of warps available on the SMs. Each DE execution unit is far faster than an SM at executing decompress, but in some workloads, SM speed will approach DE when fully saturating the SM resources. Either SM or DE can execute using host pinned data as input, enabling zero-copy decompression.
The next figure will reveal SM versus DE performance on the Silesia benchmark for LZ4, Deflate, and Snappy algorithms. Note that Snappy is newly optimized in nvCOMP 5.0, and further software optimization opportunities are possible for Deflate and LZ4.
The performance measurement is finished for 64 KiB and 512 KiB chunk sizes using “small” and “large” datasets. The big dataset is the complete Silesia dataset, while the small dataset is the primary ~50 MB of Silesia.tar (available here).


Start
The Decompression Engine in Blackwell makes it much easier to cope with one in every of the largest challenges in data-heavy workloads: fast, efficient decompression. By moving this work to dedicated hardware, applications not only see faster results but in addition unencumber GPU compute for other tasks.
With nvCOMP handling the mixing routinely, developers can make the most of these improvements without changing their code, resulting in smoother pipelines and higher performance.
To start with these recent features, explore the next resources:
