NVIDIA CUDA 13.1 introduces the biggest and most comprehensive update to the CUDA platform because it was invented 20 years ago.
On this release, you’ll find recent features and updates for improving performance and driving accelerated computing, including:
- The launch of NVIDIA CUDA Tile, our tile-based programming model for abstracting away specialized hardware, including tensor cores.
- Runtime API exposure of green contexts.
- Emulation for double and single precisions in NVIDIA cuBLAS.
- A totally rewritten CUDA programming guide, designed for each novice and advanced CUDA programmers.
CUDA Tile programming
To assist create software for current and future GPUs, NVIDIA CUDA 13.1 is launching CUDA Tile, which enables you to put in writing GPU kernels at a layer above SIMT. Currently, in SIMT programming, you specify kernels by partitioning data and defining each thread’s path of execution. Using CUDA Tile, you’ll be able to bring your code up a layer and specify chunks of information called tiles. You specify the mathematical operations to be executed on those tiles, and the compiler and runtime determine the perfect strategy to launch that work onto individual threads. The tile model abstracts away the small print of using specialized hardware corresponding to tensor cores, and your tile code will probably be compatible with future GPU architectures.
CUDA 13.1 is releasing two components for tile programming.
- CUDA Tile IR: A brand new virtual instruction set architecture (ISA) for programming NVIDIA GPUS.
- cuTile Python: A brand new domain-specific language (DSL) for authoring array and tile-based kernels in Python.
In this primary version of the software:
- CUDA tile is supported on NVIDIA Blackwell (compute capability 10.x and 12.x) products only. Future versions of CUDA will add support for more architectures.
- We’ve focused our development efforts on tile programming for AI algorithms. In future releases of CUDA, we’ll proceed so as to add more features, functionality, and performance.
- In an upcoming CUDA release, we plan to introduce an implementation in C++.
Take a look at more details about CUDA Tile IR and cuTile Python.
CUDA software updates
Listed below are another necessary software updates included on this release of CUDA.
Runtime exposure of green contexts
Green contexts in CUDA are a light-weight alternative to traditional CUDA contexts, designed to supply developers with a mechanism for finer-grained spatial partitioning and resource provisioning on the GPU. They’ve been available in the motive force API since CUDA 12.4, and starting now, green contexts can be found within the runtime API.
Green contexts enable you to define and manage distinct partitions of GPU resources, primarily Streaming Multiprocessors (SMs), and dedicate a selected set of SMs to a selected context. You may then launch CUDA kernels and manage streams that run only throughout the resources provisioned for that green context. A typical example is when your application has latency-sensitive code with priority over all other GPU work. By allocating SM resources to a dedicated green context for this code, and the remainder to a different green context for other code, you guarantee available SMs ready for this computation.
CUDA 13.1 also introduces a more customizable split() API. Developers can construct SM partitions that previously required multiple API calls and the flexibility to configure work queues to reduce false dependencies between work submitted in several green contexts.
The CUDA programming guide has more about these features and the runtime exposure of green contexts.
CUDA Multi-Process Service updates
CUDA 13.1 brings recent features and functionality to Multi-Process Service (MPS). For complete information on these recent features, please see the MPS documentation. A couple of of the highlights include:
Memory locality optimization partition
Memory locality optimization partition (MLOPart) is a feature on some NVIDIA Blackwell (compute capability 10.0 and 10.3) and newer GPUs. Users can create specialized CUDA devices optimized for improving memory locality. MLOPart devices are derived from a single underlying GPU but present as multiple devices with fewer compute resources and fewer available memory. Compute capability 10.0 and 10.3 GPUs each have two partitions.
When using MLOPart on supported GPUs, each partition appears as a definite CUDA device, with associated compute and memory resources. Currently, MLOPart is just supported on NVIDIA B200 and NVIDIA B300 products, and a future release of CUDA will support the NVIDIA GB200 and NVIDIA GB300 products.
Static streaming multiprocessor partitioning
As a substitute for the present dynamic execution resource provisioning available in MPS, static streaming multiprocessor (SM) partitioning is a feature for NVIDIA Ampere architecture (compute capability 8.0) and newer GPUs that gives a strategy to create exclusive SM partitions for MPS clients.
This mode is enabled by launching the MPS control daemon with the -S or --static-partitioning flag, and its major purpose is to deliver deterministic resource allocation and improved isolation between MPS clients. The elemental unit of partitioning is a “chunk,” which varies in size based on the GPU architecture—for instance, 8 SMs on Hopper (compute capability 9.0) and newer discrete GPUs.
Emulation for double and single precisions in cuBLAS
While not strictly a CUDA 13.1 update, the cuBLAS update in NVIDIA CUDA Toolkit 13.0 introduced recent APIs and implementations for reinforcing the performance of double-precision (FP64) matrix multiplications (matmuls). That is achieved through floating-point (FP) emulation on Tensor Cores present in GPU architectures corresponding to NVIDIA GB200 NVL72, and NVIDIA RTX PRO 6000 Blackwell Server Edition. For comprehensive information on GPU compatibility for each FP32 and FP64 emulation, consult with the cuBLAS documentation.
Developer tools are a vital a part of the CUDA platform. This release delivers several innovations and have enhancements, including:
CUDA Tile kernel profiling
NVIDIA Nsight Compute 2025.4 adds support for profiling CUDA Tile kernels. Updates include a brand new “Result Type” column on the summary page for denoting Tile vs. SIMT kernels. A brand new “Tile Statistics” section on the small print page summarizes Tile dimensions and utilization of necessary pipelines. The source page also supports mapping metrics to the high-level cuTile kernel source.


This Nsight Compute release also adds support for profiling CUDA graph nodes from device-launched graphs and source page navigation improvements with clickable label links for each compiler-generated and user-generated labels.
Compile-time patching
NVIDIA Compute Sanitizer 2025.4 adds support for NVIDIA CUDA Compiler (NVCC) compile-time patching through the -fdevice-sanitize=memcheck compiler flag. This patching enhances memory error detection and improves compute sanitizer performance.
Compile-time instrumentation integrates error detection directly into NVCC for faster runs while catching more subtle memory issues, corresponding to illegal accesses between adjoining allocations, through advanced base-and-bounds evaluation. This implies you’ll be able to debug memory problems without sacrificing speed, run more tests, and maintain productivity. At once, only memcheck is supported.
To make use of this recent feature, compile your code with the NVCC flag as follows
nvcc -fdevice-sanitize=memcheck -o myapp myapp.cu
Then run your application with compute-sanitizer using the memcheck tool.
compute-sanitizer --tool memcheck myapp
For complete information on compile-time patching, consult with the compute-sanitizer documentation.
NVIDIA Nsight Systems
NVIDIA Nsight Systems 2025.6.1 releases concurrently with CUDA Toolkit 13.1, with several recent tracing features, including:
- System-wide CUDA trace:
--cuda-trace-scopeenables tracing across process trees or the complete system. - CUDA host function trace: Added trace support for CUDA Graph host function nodes and
cudaLaunchHostFunc(), which executes on the host and blocks the stream. - CUDA hardware trace: Hardware-based tracing is now the default when supported; use
--trace=cuda-swto revert to software mode. - Green context timeline rows now show SM allocation in tooltips to assist users understand GPU resource utilization.
Math libraries
Recent features across our core CUDA Toolkit math libraries include:
- NVIDIA cuBLAS: A brand new experimental API with Grouped GEMM supports FP8 and BF16/FP16 for Blackwell GPUs. Grouped GEMMs for the noted datatypes with CUDA Graph support provides a host-synchronization-free implementation with device-side shapes for as much as 4x speed-up over a multi-stream GEMM implementation within the MoE use case.
- NVIDIA cuSPARSE: A brand new sparse matrix vector multiplication (SpMVOp) API with improved performance in comparison with the CsrMV API. This API supports CSR format, 32-bit index, double precision, and user-defined epilogues.
- NVIDIA cuFFT: A brand new set of APIs, called cuFFT device API, provides host functions for querying or generating device function code and database metadata in a C++ header file. Designed for the cuFFTDx library, it facilitates the generation of cuFFTDx code blocks by querying cuFFT, which might be linked with the cuFFTDx application to enhance performance.
Performance updates on recent Blackwell architectures can be found. Select updates for key APIs, and performance follows.
cuBLAS Blackwell performance
CUDA Toolkit 12.9 introduced block-scaled FP4 and FP8 matmuls on NVIDIA Blackwell. CUDA 13.1 adds performance support for these data types and BF16. Speedups on NVIDIA Blackwell and Hopper are shown in Figure 2.


cuSOLVER Blackwell performance
CUDA 13.1 continues to enhance batched SYEVD and GEEV APIs for eigen-decomposition, delivering performance enhancements.
Batched SYEV (cusolverDnXsyevBatched) is a uniform batched version of cuSOLVER’s SYEV routine, computing eigenvalues and eigenvectors for symmetric/Hermitian matrices, ideal for parallel solving of many small matrices.
Figure 3 shows tests on a batch size of 5,000 (24-256 rows) with a few 2x speedup on the NVIDIA Blackwell RTX Pro 6000 Server Edition in comparison with the NVIDIA L40S, correlating with expected memory throughput increases.


Figure 4 shows the performance speed-up of cusolverDnXgeev (GEEV), which computes eigenvalues and eigenvectors of a general (non-symmetric) dense matrix. GEEV is a hybrid CPU/GPU algorithm. A single CPU thread manages aggressive early deflation within the QR algorithm, while the GPU handles the remainder. Relative performance speed-ups for matrix sizes from 1,024 to 32,768 are shown.


NVIDIA CUDA Core Compute Libraries
NVIDIA CUDA Core Compute Libraries (CCCL) features several innovations and enhancements for CUB.
Deterministic floating-point reductions
As a consequence of the non-associativity of floating point addition, cub::DeviceReduce historically only guaranteed bitwise-identical results run-to-run on the identical GPU. This was implemented as a two-pass algorithm.
NVIDIA CCCL 3.1, a part of CUDA 13.1, provides two additional floating-point determinism options so that you can make trade-offs between determinism and performance.
- Not-guaranteed: Single-pass reduction using atomics. This isn’t guaranteed to supply bitwise-identical results.
- GPU-to-GPU: Based on reproducible reduction in Kate Clark’s NVIDIA GTC 2024 talk. Results are at all times bitwise-identical.
The determinism option might be set through a flag, as shown in the next code.
// Pick your required trade-off of performance and determinism
// auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed);
// auto env = cuda::execution::require(cuda::execution::determinism::run_to_run);
// auto env = cuda::execution::require(cuda::execution::determinism::gpu_to_gpu);
cub::DeviceReduce::Sum(..., env);


More convenient single-phase CUB APIs
Nearly every CUB algorithm requires temporary storage for intermediate scratch space. Historically, users had to question and allocate the mandatory temporary storage through a two-phase call pattern that’s cumbersome and error-prone if arguments aren’t passed the identical between two invocations.
CCCL 3.1 adds recent overloads to some CUB algorithms that accept a memory resource, so that you skip the temp-storage query/allocate/free pattern.
Before (two-phase)
// determine temporary storage size
cub::DeviceScan::ExclusiveSum(d_temp_storage,
temp_storage_bytes,
nullptr, ...);
// Allocate the required temporary storage
cudaMallocAsync(&d_temp_storage,
temp_storage_bytes, stream);
// run the actual scan
cub::DeviceScan::ExclusiveSum(d_temp_storage,
temp_storage_bytes,
d_input...);
// Free the temporary storage
cudaFreeAsync(temp_storage, stream);
After (single-phase)
// Pool mr uses cudaMallocAsync under the hood
cuda::device_memory_pool mr{cuda::devices[0]};
// Single call. Temp storage is handled by the pool.
cub::DeviceScan::ExclusiveSum(d_input,..., mr);
Learn more
The discharge of CUDA 13.1 brings many recent features and ushers in a brand new era of GPU programming with CUDA Tile. Take a look at CUDA Tile resources, download CUDA Toolkit 13.1, and start today.
Acknowledgements
Due to the next NVIDIA contributors: Jake Hemstad, Becca Zandstein, Jackson Marusarz, Kyrylo Perelygin, and Myrto Papadopoulou.
