CUDA 13.2 Introduces Enhanced CUDA Tile Support and Recent Python Features

-


CUDA 13.2 arrives with a serious update: NVIDIA CUDA Tile is now supported on devices of compute capability 8.X architectures (NVIDIA Ampere and NVIDIA Ada), in addition to 10.X and 12.X architectures (NVIDIA Blackwell). In an upcoming release of the CUDA Toolkit, all GPU architectures starting with Ampere might be fully supported. In the event you’re using Ampere, Ada, or Blackwell GPU architectures, take a look at the cuTile Python Quickstart guide to start with CUDA Tile. 

This post explores the CUDA 13.2 release, which boosts developer productivity with a wide range of latest Python additions, including profiling in CUDA Python and debugging Numba kernels. The mathematics libraries provide expanded support for high-performance emulated libraries, and CUDA Core Compute Libraries (CCCL) proceed so as to add each performance and have improvements, providing C++ developers with a high-performance, modern interface to GPU programming. 

cuTile Python

cuTile Python, the Python DSL expression of the CUDA Tile programming model, is releasing plenty of feature enhancements. These include enhanced language support for the next:

  • Recursive functions
  • Closures with capture (nested functions, lambda functions)
  • Custom reduction and scan functions
  • Allowing assignments with type annotations
  • Enhanced array support for Array.slice to create a view on a subarray

We’ve also provided a simple installation path. The next pip install command will install cuTile Python and pull in all of the needed dependencies without requiring a separate system-wide installation of the CUDA Toolkit.

pip install cuda-tile[tileiras]

Core enhancements

Core enhancements in CUDA 13.2 are detailed on this section.

memcpy with attributes

A previous release of CUDA (12.8) introduced batched memcpy APIs. These permit you to specify batches of memcopies to be called with a single function call. It’s also possible to specify attributes to higher control and optimize the memory transfers. 

These APIs enable more control over your memory transfers. Nonetheless, if you’ve got only a single transfer and in addition wish to use the attribute features, it is advisable call a batched API after which use batch size of 1. This can be a bit cumbersome.

To simplify this use case, two latest API functions have been added, cudaMemcpyWithAttributesAsync and cudaMemcpy3DWithAttributesAsync. These functions permit you to reap the benefits of using attributes in your memory calls, without requiring use of the more involved batched interface. 

And to simplify your programming, in the event you already use cudaMemcpyAsync to your transfer and you wish to use attributes, you possibly can proceed using cudaMemcpyAsync. It’s overloaded with the identical argument list as cudaMemcpyWithAttributesAsync.

Local memory (LMEM) on GPUs is allocated on a per thread basis and used for register spilling, stack variables, and the like. Starting with CUDA 13.2 and CUDA Driver R595, running on Windows in driver mode WDDM, LMEM usage has been significantly reduced. The results of this alteration might be seen primarily in memory-constrained vGPU environments.

Query the properties of a memory pool

CUDA provides the power to make use of memory pools for efficient memory management. CUDA 13.2 introduces an API to question the properties of a memory pool from the memory pool handle. These properties are obtained by calling cudaMemPoolGetAttribute with the suitable flags. 

One use case for this latest feature is making a memory pool of the identical type as a memory pool already created. For instance, when using CUDA Graphs, the API cudaGraphAddMemAllocNode accepts pool properties as a parameter. You should utilize the properties of a current memory pool to create a brand new pool with the identical properties.

Windows compute drivers default to MCDM as a substitute of TCC

On Microsoft Windows systems, starting with CUDA driver version R595, on compatible systems, GPUs that previously began by default in TCC will now start by default in MCDM. This transformation should address compatibility issues with some systems where users would have a yellow bang on their TCC GPUs at startup attributable to some incompatibility with OS/System features. For users with a dependency on TCC, it’s  still available for now and may be enabled by utilizing nvidia-smi -dm 1 -g .

Going forward we intend to progressively permanently transition to MCDM because it brings features that were previously reserved to GPUs in WDDM mode:

  • WSL2: MCDM GPUs will show up in WSL2 and have the ability to run CUDA in WSL
  • Containers: Native (and WSL) containers are supported
  • Advanced memory management API: cuMemCreate, cudaMallocAsync, and all their related APIs are actually supported
  • RDMA through the identical interface as WDDM RDMA that was released in CUDA Toolkit 13.1
  • Memory oversubscription and trim notification.

Due to some extra overhead in MCDM we’re aware that right away submission latency is barely higher than on TCC and we’re actively working on bringing it on par (each on WDDM and MCDM) with TCC and Linux native to make sure WDDM/MCDM would change into an acceptable and future proof driver model for all our GPUs on Windows.

CUDA_DISABLE_PERF_BOOST

CUDA Toolkit 13.2 and CUDA driver versions 580 and later added a brand new environment variable, CUDA_DISABLE_PERF_BOOST. This enables for disabling the default behavior of boosting the GPU to the next power state when running CUDA applications. Setting this environment variable to 1 will disable the boost. Disabling the performance boost may end in power savings when using features like NVENC/NVDEC.

CUDA Graphs polymorphic function to acquire graph node parameters

CUDA Graphs provide you the power to create a workflow of GPU operations, like kernel launches and memory copies, as a single unit, relatively than a series of individual commands. CUDA 13.2 adds a brand new polymorphic API function cudaGraphNodeGetParams that means that you can obtain the parameters of the graph node. This can be a companion function to existing polymorphic functions like cudaGraphNodeSetParams, cudaGraphAddNode, and cudaGraphExecNodeSetParams.

Compilers

CUDA 13.2 brings latest compiler updates, including support for brand new host compilers resembling Visual Studio 2026, ARM C Language Extension support for gcc, and a single unified toolkit for Tegra and Desktop GPUs, which reduces overheads for containers and libraries.

Embedded devices

Previously, in CUDA 13.0 (and NVIDIA JetPack 7.0), the unified CUDA for Arm was introduced, streamlining development for Arm platforms by unifying the CUDA Toolkit across server-class and embedded devices resembling NVIDIA Jetson Thor

Ranging from CUDA 13.2 (and within the upcoming JetPack 7.2—stay tuned), the identical Arm SBSA CUDA Toolkit may be used across all Arm targets. This release also supports NVIDIA Jetson Orin devices on the identical CUDA SBSA toolkit. For developers, this implies reduced duplication in CI pipelines, simplified container management, and elimination of subtle bugs and inconsistencies that previously got here from juggling different SDKs. 

CUDA 13.2 and JetPack 7.2 introduce NVIDIA Multi-Instance GPU (MIG) support, allowing the GPU integrated with Jetson Thor to be partitioned into two fully isolated instances, each with dedicated memory, cache, and compute resources. This capability is especially useful for mixed-criticality applications, resembling humanoid robotics. In these cases, developers can isolate safety-critical workloads (motor control and safety systems, for instance) from noncritical processing tasks.

Without MIG, safety‑critical and noncritical workloads running on the identical GPU, resembling a low‑latency motor control alongside heavier perception or language models, compete for shared resources. A bursty task with high memory bandwidth demand can steal capability from safety‑critical kernels, causing jitter and missed latency deadlines for control and safety systems.

With MIG, critical and noncritical workloads run on separate GPU instances, each with dedicated compute, memory, and bandwidth. This isolation delivers predictable latency and quality of service for control and important tasks, while keeping the GPU highly utilized by concurrently running heavier perception or language workloads on the opposite instance.

Math libraries

CUDA 13.2 introduces improvements for math libraries including NVIDIA cuBLAS and NVIDIA cuSOLVER.

NVIDIA cuBLAS 

A brand new experimental API with Grouped GEMM now supports MXFP8 for NVIDIA Blackwell GPUs. Prior support (in CUDA 13.1) included FP8 and BF16/FP16 Blackwell GPU support. Grouped GEMMs for the noted datatypes with CUDA Graphs support provides a host-synchronization-free implementation with device-side shapes for speedups of as much as 4x over a multistream GEMM implementation within the mixture of experts (MoE) use case.

NVIDIA cuSOLVER 

cuSOLVERD APIs for FP64‑emulated calculations have been introduced. This permits platforms with a high ratio of INT8‑to‑FP64 throughput to attain significant performance gains, particularly for compute‑intensive workloads. The advantages of emulation are most apparent in key APIs for QR, LU, and Cholesky factorizations. To more in regards to the latest advances in emulation techniques from NVIDIA, see Unlocking Tensor Core Performance with Floating Point Emulation in cuBLAS.

Figure 1 shows the outcomes of FP64‑emulated GDEQRF, DGETRF, and DPOTRF on NVIDIA B200 systems. The performance advantages increase with matrix size and may reach as much as 2x for QR, essentially the most compute‑intensive of the three operations, when matrix sizes approach 80K.

Graph showing FP64 emulated results for GDEQRF, DGETRF, and DPOTRF on NVIDIA B200 systems. As matrix sizes go from 20,000 to 80,000, the speedups for these results go from slightly over 1x all the way up to 2x, compared to using non-emulated functions.
Graph showing FP64 emulated results for GDEQRF, DGETRF, and DPOTRF on NVIDIA B200 systems. As matrix sizes go from 20,000 to 80,000, the speedups for these results go from slightly over 1x all the way up to 2x, compared to using non-emulated functions.
Figure 1. FP64‑emulated GDEQRF, DGETRF, and DPOTRF on NVIDIA B200 systems 

Developer tools latest to this release are detailed on this section.

NVIDIA Nsight Python

NVIDIA Nsight Python is a brand new kernel profiling interface that brings the ability of NVIDIA profiling tools on to Python developers. With this release, you possibly can seamlessly profile CUDA kernels launched through Python frameworks across multiple configurations directly from Python. 

Using just a number of decorators, users can robotically configure, profile, and plot kernel performance comparisons. Nsight Python also provides access to the performance data in common Python data structures for advanced evaluation. Download Nsight Python from PyPI. It’s also possible to contribute to the NVIDIA/nsight-python GitHub repo and visit the NVIDIA Developer Forum with any questions or issues.

@nsight.analyze.plot("02_paramater_sweep.png")
@nsight.analyze.kernel(configs=sizes, runs=10)
def benchmark_matmul_sizes(n: int) -> None:
	"""
	Benchmark matrix multiplication across different sizes.
	The 'n' parameter comes from the configs list.
 	"""
	a = torch.randn(n, n, device="cuda")
	b = torch.randn(n, n, device="cuda")

 	with nsight.annotate("matmul"):
		_ = a @ b

Numba-CUDA debugging

For the primary time, debugging Numba-CUDA kernels running on a GPU is now possible with CUDA-GDB command-line debugging and NVIDIA Nsight Visual Studio Code Edition. Users can set breakpoints, step through statements, and inspect program state, as with host and native CUDA debuggers. This initial support has a limited feature set and the team is actively on the lookout for feedback to enhance it. To learn more, take a look at the Numba-CUDA debugging documentation and reach out for help or feedback on the Developer Forum

NVIDIA Nsight Tools updates

NVIDIA Nsight Compute 2026.1 features a latest report clustering and merging tool accessible from the File > Merge Reports menu. This helps users understand data from repeated experiments, separate profiling sessions, or multiprocess applications generating multiple reports.

Screenshot of Nsight Compute profile, illustrating the report clustering tool.
Screenshot of Nsight Compute profile, illustrating the report clustering tool.
Figure 2. Nsight Compute report clustering tool

 A brand new Register Dependency correlation window on the Source page helps users discover source line dependencies to quickly locate bottlenecks. The CUDA Graphs viewer tool window has been significantly improved to indicate graphs as they’re built and profiled within the interactive profiling mode and visually correlates collected results to graph nodes. Nsight Compute is included within the CUDA Toolkit and is offered as a standalone download. 

NVIDIA Nsight Cloud includes updates to the Nsight Operator for Kubernetes together with Nsight Streamer Kubernetes and Docker containers for accessing and viewing Nsight tool reports from inside a Cluster.

NVIDIA Nsight Copilot is a free AI-powered CUDA coding assistant that’s now available to everyone with an NVIDIA Developer account.

NVIDIA Nsight Systems 2026.1 includes:

  • PyTorch profiling improvements to display shape and training parameters for forward and backward extension modules
  • Support for Python 3.14 within the Python sampling feature
  • A brand new choice to capture metrics of GPUDirect Storage DMA operations

CCCL

CUDA 13.2 ships with the three.2 version of CCCL. Highlights include latest modern CUDA C++ runtime APIs and latest optimized algorithms, including Top-K.

Modern CUDA C++ runtime

CCCL 3.2 broadly introduces latest idiomatic C++ interfaces for core CUDA runtime and driver functionality. 

In the event you’ve written CUDA C++ you’ve likely built (or adopted) some type of convenience wrappers around today’s C-like APIs resembling cudaMalloc or cudaStreamCreate

The brand new APIs added in CCCL 3.2 are meant to offer the productivity and safety advantages of C++ for core CUDA constructs so you possibly can spend less time reinventing wrappers and more time writing kernels and algorithms.

Highlights include:

  • Recent convenient vocabulary types for core CUDA concepts (cuda::stream, cuda::event, cuda::arch_traits)
  • Easier memory management with Memory Resources and cuda::buffer
  • More powerful and convenient kernel launch with cuda::launch  

Example (vector add, revisited):

cuda::device_ref device = cuda::devices[0];
cuda::stream stream{device};
auto pool = cuda::device_default_memory_pool(device);

int num_elements = 1000;
auto A = cuda::make_buffer(stream, pool, num_elements, 1.0);
auto B = cuda::make_buffer(stream, pool, num_elements, 2.0);
auto C = cuda::make_buffer(stream, pool, num_elements, cuda::no_init);

constexpr int threads_per_block = 256;
auto config = cuda::distribute(num_elements);
auto kernel = [] __device__ (auto config, cuda::std::span A, 
                                            cuda::std::span B, 
                                            cuda::std::span C){
    auto tid = cuda::gpu_thread.rank(cuda::grid, config);
    if (tid < A.size())
        C[tid] = A[tid] + B[tid];
};
cuda::launch(stream, config, kernel, config, A, B, C);

Try this instance live on Compiler Explorer.

Stay tuned for a deeper dive into the design goals and intended usage patterns, and the way these latest APIs fit alongside existing CUDA APIs.

Recent algorithms

Algorithms latest to CUDA 13.2 are detailed on this section.

Top-K selection

CCCL 3.2 introduces cub::DeviceTopK (for instance, cub::DeviceTopK::MaxKeys) to pick the K largest (or smallest) elements without sorting the complete input. For workloads where K is small, this will deliver as much as 5x speedups over a full radix sort, and may reduce memory consumption whenever you don’t need sorted results. 

Top‑K is an energetic area of ongoing work for CCCL. The roadmap includes planned segmented Top‑K in addition to block‑scope and warp‑scope Top‑K variants. To learn more about what’s planned and share your most significant Top‑K use cases, see NVIDIA/cccl GitHub Issue #5673.

Graph showing normalized execution time of cub::DeviceTopK::MaxKeys for K=3, compared to radix sort. The graph shows values from 2^18 to 2^30 elements. The normalized time of the TopK function compared to radix sort goes from 45% to 20% (lower is better) of the radix sort walltime. 
Graph showing normalized execution time of cub::DeviceTopK::MaxKeys for K=3, compared to radix sort. The graph shows values from 2^18 to 2^30 elements. The normalized time of the TopK function compared to radix sort goes from 45% to 20% (lower is better) of the radix sort walltime.
Figure 3. Normalized execution time comparing the brand new cub::DeviceTopK::MaxKeys for K=3 to the common solution of performing a full radix sort

Fixed-size segmented reduction

CCCL 3.2 now provides a brand new cub::DeviceSegmentedReduce variant that accepts a uniform segment_size, eliminating offset iterator overhead within the common case when segments are fixed-size. This permits optimizations for each small segment sizes (as much as 66x) and enormous segment sizes (as much as 14x). 

// Recent API accepts fixed segment_size as a substitute of per-segment begin/end offsets
cub::DeviceSegmentedReduce::Sum(d_temp, temp_bytes, input, output,  
                                num_segments, segment_size);

In Figure 4, the brand new fixed-size variant shows significant speed-up for each small and enormous segments in comparison with the present implementation that specifies begin and end offsets for every segment.

Normalized execution time comparing the new fixed-size segment overload of cub::DeviceSegmentedReduce to the existing implementation. The new fixed-size variant shows significant speed-up for both small segments (up to 66x) and large segments (up to 14x) compared to the existing implementation that specifies begin and end offsets for each segment.
Normalized execution time comparing the new fixed-size segment overload of cub::DeviceSegmentedReduce to the existing implementation. The new fixed-size variant shows significant speed-up for both small segments (up to 66x) and large segments (up to 14x) compared to the existing implementation that specifies begin and end offsets for each segment.
Figure 4. Normalized execution time comparing the brand new fixed-size segment overload of cub::DeviceSegmentedReduce to the present implementation

More latest algorithms in CCCL 3.2

Segmented Scan: cub::DeviceSegmentedScan provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments. 

Binary Search: cub::DeviceFind::[Upper/LowerBound] performs a parallel seek for multiple values in an ordered sequence. 

Search: cub::DeviceFind::FindIf searches the unordered input for the primary element that satisfies a given condition. Due to its early-exit logic, it might probably be as much as 7x faster than searching the complete sequence.

CUDA Python

CuPy now supports CUDA 13.0 and 13.1, and wheels can be found in PyPI for CUDA 12 and CUDA 13. This implies it’s easier than ever to put in CuPy with out a system-wide CUDA Toolkit.

pip install cupy-cuda12x
pip install cupy-cuda13x

CuPy now implements the CUDA Stream Protocol, enabling direct stream sharing with PyTorch, JAX, and other frameworks that support the protocol. This implies zero-copy interoperability without manual pointer management.

# Share a CuPy stream with PyTorch
pytorch_stream = torch.cuda.ExternalStream(cupy_stream)

# Or import an external stream into CuPy
cupy_stream = cupy.cuda.Stream.from_external(pytorch_stream)

Support has been added to ml_dtypes.bfloat16, which brings native reduced-precision computation to CuPy, a kind commonly utilized in AI training and inference. Performance has improved in some core operations through fast-path optimizations for generalized ufuncs, array operators, and scalar handling. Support for multithreaded applications has improved. CuPy arrays can now be viewed as cuda::std::mdspan objects through ndarray.mdspan with control over 32-bit and 64-bit indexing. This provides users with more control over arithmetic operations and performance.

cuda.core 0.6 introduces NVML bindings (cuda.bindings.nvml) for GPU monitoring and management, and latest nvFatbin bindings (cuda.bindings.nvfatbin) for fat binary manipulation. The brand new cuda.core.system module provides Pythonic access to system information resembling device thermal monitoring, and CPU/GPU affinity, built on top of NVML.

Support for constructing CUDA Graphs has graduated from the experimental namespace and it is offered under the principal cuda.core namespace. This permits developers to capture sequences of operations and replay them with minimal overhead, and supports advanced patterns resembling conditional execution (if_cond and while_loop), fork-join. The next code shows how the API works:

# Construct a graph by capturing operations
gb = device.create_graph_builder()
gb.begin_building()

# Capture kernel launches within the graph (not executed)
launch(gb, LaunchConfig(grid=256, block=256), kernel_a, data_ptr)
launch(gb, LaunchConfig(grid=256, block=256), kernel_b, data_ptr)
launch(gb, LaunchConfig(grid=256, block=256), kernel_c, data_ptr)

# Finalize and instantiate the graph
graph = gb.end_building().complete()

# Launch the graph into an existing CUDA Stream
graph.launch(stream)

For more information, see the cuda.core.GraphBuilder docs and examples.

Start with CUDA 13.2

CUDA 13.2 simplifies high-performance development by continuing to raise Python as a first-class citizen and introducing productivity-focused language features that bridge the gap between ease of use and peak GPU performance. 

Download the CUDA 13.2 Toolkit to start.

Acknowledgments

Due to the next NVIDIA contributors: Jake Hemstad, Becca Zandstein, Jackson Marusarz, Mridula Prakash, Rekha Mukund, Daniel Rodriquez, Bo Dong, Andy Terrel, Raphael Boissel, and Rob Armstrong.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x