With its largest advancement because the NVIDIA CUDA platform was invented in 2006, CUDA 13.1 is launching NVIDIA CUDA Tile. This exciting innovation introduces a virtual instruction set for tile-based parallel programming, specializing in the power to jot down algorithms at a better level and abstract away the small print of specialised hardware, equivalent to tensor cores.
Why tile programming for GPUs?
CUDA exposes a single-instruction, multiple-thread (SIMT) hardware and programming model for developers. This requires (and enables) you to exhibit fine-grained control over how your code is executed with maximum flexibility and specificity. Nevertheless, it might probably also require considerable effort to jot down code that performs well, especially across multiple GPU architectures.
There are lots of libraries to assist developers extract performance, equivalent to NVIDIA CUDA-X and NVIDIA CUTLASS. CUDA Tile introduces a brand new strategy to program GPUs at a better level than SIMT.
With the evolution of computational workloads, especially in AI, tensors have turn into a fundamental data type. NVIDIA has developed specialized hardware to operate on tensors, equivalent to NVIDIA Tensor Cores (TC) and NVIDIA Tensor Memory Accelerators (TMA), which are actually integral to each recent GPU architecture.
With more complex hardware, more software is required to assist harness these capabilities. CUDA Tile abstracts away tensor cores and their programming models in order that code using CUDA Tile is compatible with current and future tensor core architectures.
Tile-based programming lets you program your algorithm by specifying chunks of information, or tiles, after which defining the computations performed on those tiles. You don’t have to set how your algorithm is executed at an element-by-element level: the compiler and runtime will handle that for you.
Figure 1 shows the conceptual differences between the tile model we’re introducing with CUDA Tile, and the CUDA SIMT model.


This programming paradigm is common in languages equivalent to Python, where libraries like NumPy enable you to specify data types like matrices, then specify and execute bulk operations with easy code. Under the covers, the appropriate things occur, and your computations proceed completely transparent to you.
The inspiration of CUDA Tile is CUDA Tile IR (intermediate representation). CUDA Tile IR introduces a virtual instruction set that allows native programming of the hardware as tile operations. Developers can write higher-level code that’s efficiently executed across multiple generations of GPUs with minimal changes.
While NVIDIA Parallel Thread Execution (PTX) ensures portability for SIMT programs, CUDA Tile IR extends the CUDA platform with native support for tile-based programs. Developers give attention to partitioning their data-parallel programs into tiles and tile blocks, letting CUDA Tile IR handle the mapping onto hardware resources equivalent to threads, the memory hierarchy, and tensor cores.
By raising the extent of abstraction, CUDA Tile IR enables users to construct higher-level hardware-specific compilers, frameworks, and domain-specific languages (DSLs) for NVIDIA hardware. CUDA Tile IR for tile programming is analogous to PTX for SIMT programming.
One thing to indicate is that it’s not an either/or situation. Tile programming on GPUs is one other approach to writing GPU code, but you don’t have to make a choice from SIMT and tile programming; they coexist. Whenever you need SIMT, you write your kernels as you mostly have. When you desire to operate using tensor cores, you write tile kernels
Figure 2 shows a high-level diagram of how CUDA Tile suits right into a representative software stack, and the way the tile path exists as a separate but complementary path to the present SIMT path.


How developers can use CUDA Tile to jot down GPU applications
CUDA Tile IR is one layer beneath where a overwhelming majority of programmers will interface with tile programming. Unless you’re writing a compiler or library, you most likely won’t have to concern yourself with the small print of the CUDA Tile IR software.
- NVIDIA cuTile Python: Most developers will interface with CUDA tile programming through software like NVIDIA cuTile Python—an NVIDIA Python implementation that uses CUDA Tile IR because the back end. Now we have a blog post that explains the right way to use cuTile-python with links to sample code and documentation.
- CUDA Tile IR: For developers seeking to construct their very own DSL compiler or library, CUDA Tile IR is where you’ll interface with CUDA Tile. The CUDA Tile IR documentation and specification include information on the CUDA Tile IR programming abstractions, syntax, and semantics. In the event you’re writing a tool/compiler/library that currently targets PTX, you then can adapt your software to also goal CUDA Tile IR.
How one can get the CUDA Tile software
CUDA Tile was launched with CUDA 13.1. All the knowledge, including links to documentation, GitHub repos, and sample code, is on our CUDA Tile page.
