The release of NVIDIA CUDA 13.1 introduces tile-based programming for GPUs, making it some of the fundamental additions to GPU programming since CUDA was invented. Writing GPU tile kernels enables you to jot down your algorithm at a better level than a single-instruction multiple-thread (SIMT) model, while the compiler and runtime handle the partitioning of labor onto threads under the covers. Tile kernels also help abstract away special-purpose hardware like tensor cores, and write code that’ll be compatible with future GPU architectures. With the launch of NVIDIA cuTile Python, you possibly can write tile kernels in Python.
What’s cuTile Python?
cuTile Python is an expression of the CUDA Tile programming model in Python, built on top of the CUDA Tile IR specification. It enables you to jot down tile kernels in Python and express GPU kernels using a tile-based model, somewhat than or along with a single instruction, multiple threads (SIMT) model.
SIMT programming requires specifying each GPU thread of execution. In principle, each thread can operate independently and execute a novel code path from some other thread. In practice, to make use of GPU hardware effectively, it’s typical to program algorithms where each thread performs the identical work on separate pieces of information.
SIMT enables maximum flexibility and specificity, but may also require more manual tuning to attain top performance. The tile model abstracts away among the HW intricacies. You possibly can concentrate on your algorithm at a better level, while the NVIDIA CUDA compiler and runtime handle partitioning your tile algorithm into threads and launching them onto the GPU.
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs. On this model:
- Arrays are the first data structure.
- Tiles are subsets of arrays that kernels operate on.
- Kernels are functions which might be executed in parallel by blocks.
- Blocks are subsets of the GPU; operations on tiles are parallelized across each block.
cuTile automates block-level parallelism and asynchrony, memory movement, and other low-level details of GPU programming. It would leverage the advanced capabilities of NVIDIA hardware (reminiscent of tensor cores, shared memory, and tensor memory accelerators) without requiring explicit programming. cuTile is portable across different NVIDIA GPU architectures, enabling you to make use of the newest hardware features without rewriting your code.
Who’s cuTile for?
cuTile is for general-purpose data-parallel GPU kernel authoring. Our efforts have been focused on optimizing cuTile for the forms of computations typically encountered in AI/ML applications. We’ll proceed to evolve cuTile, adding functionality and performance features to expand the range of workloads it will possibly optimize.
You could be asking why you’d use cuTile to jot down kernels when CUDA C++ or CUDA Python has worked well to date. We talk more about this in one other post describing the CUDA tile model. The short answer is that as GPU hardware becomes more complex, we’re providing an abstraction layer at an affordable level so developers can focus more on algorithms and fewer on mapping an algorithm to specific hardware.
Writing tile programs lets you goal tensor cores with code compatible with future GPU architectures. Just as Parallel Thread Execution (PTX) provides the virtual Instruction Set Architecture (ISA) that underlies the SIMT model for GPU programming, Tile IR provides the virtual ISA for tile-based programming. It enables higher-level algorithm expression, while the software and hardware transparently map that representation to tensor cores to deliver peak performance.
cuTile Python example
What does cuTile Python code seem like? When you’ve learned CUDA C++, you almost certainly encountered the canonical vector addition kernel. Assuming the info has been copied from the host to the device, a vector add kernel in CUDA SIMT looks something like the next, which takes two vectors and adds them together elementwise to provide a 3rd vector. That is one in all the best CUDA kernels you possibly can write.
__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)
{
/* calculate my thread index */
int workIndex = threadIdx.x + blockIdx.x*blockDim.x;
if(workIndex < vectorLength)
{
/* perform the vector addition */
C[workIndex] = A[workIndex] + B[workIndex];
}
}
On this kernel, each thread’s work is explicitly specified, and the programmer, when launching this kernel, selects the variety of blocks and threads for launch.
Now, let’s take a look at the equivalent code written in cuTile Python. We don’t must specify what each thread does. We only should break the info into tiles and specify the mathematical operations for every tile. Every part else is handled for us.
The cuTile Python kernel looks as follows:
import cuda.tile as ct
@ct.kernel
def vector_add(a, b, c, tile_size: ct.Constant[int]):
# Get the 1D pid
pid = ct.bid(0)
# Load input tiles
a_tile = ct.load(a, index=(pid,) , shape=(tile_size, ) )
b_tile = ct.load(b, index=(pid,) , shape=(tile_size, ) )
# Perform elementwise addition
result = a_tile + b_tile
# Store result
ct.store(c, index=(pid, ), tile=result)
ct.bid(0) is the function that obtains the block ID along the (on this case) zeroth axis. It’s akin to how SIMT kernel writers would reference blockIdx.x and threadIdx.x, for instance. ct.load() is the function that loads a tile of information, with the requisite index and shape, from device memory. Once data is loaded into tiles, these tiles will be utilized in computations. When all of the computations are complete, ct.store() puts the tiled date back into GPU device memory.
Putting all of it together
Now we’ll show the right way to call this vector_add kernel in Python using an entire Python script you can try yourself. The next is the whole code, including the kernel and the predominant function.
"""
Example demonstrating easy vector addition.
Shows the right way to perform elementwise operations on vectors.
"""
from math import ceil
import cupy as cp
import numpy as np
import cuda.tile as ct
@ct.kernel
def vector_add(a, b, c, tile_size: ct.Constant[int]):
# Get the 1D pid
pid = ct.bid(0)
# Load input tiles
a_tile = ct.load(a, index=(pid,) , shape=(tile_size, ) )
b_tile = ct.load(b, index=(pid,) , shape=(tile_size, ) )
# Perform elementwise addition
result = a_tile + b_tile
# Store result
ct.store(c, index=(pid, ), tile=result)
def test():
# Create input data
vector_size = 2**12
tile_size = 2**4
grid = (ceil(vector_size / tile_size),1,1)
a = cp.random.uniform(-1, 1, vector_size)
b = cp.random.uniform(-1, 1, vector_size)
c = cp.zeros_like(a)
# Launch kernel
ct.launch(cp.cuda.get_current_stream(),
grid, # 1D grid of processors
vector_add,
(a, b, c, tile_size))
# Copy to host only to check
a_np = cp.asnumpy(a)
b_np = cp.asnumpy(b)
c_np = cp.asnumpy(c)
# Confirm results
expected = a_np + b_np
np.testing.assert_array_almost_equal(c_np, expected)
print("✓ vector_add_example passed!")
if __name__ == "__main__":
test()
Assuming you’ve already installed all of the requisite software, including cuTile Python and CuPy, running this code is so simple as invoking Python.
$ python3 VectorAdd_quickstart.py
✓ vector_add_example passed!
Congratulations, you only ran your first cuTile Python program!
cuTile kernels will be profiled with NVIDIA Nsight Compute in the identical way as SIMT kernels.
$ ncu -o VecAddProfile --set detailed python3 VectorAdd_quickstart.py
When you’ve created the profile and opened it with the graphical version of Nsight Compute:
- Select the
vector_addkernel - Select the “Details” tab
- Expand the “Tile Statistics” report section
You need to see a picture much like Figure 1.


Notice the Tile Statistics report section includes the variety of tile blocks specified, block size (chosen by compiler), and various other tile-specific information.
The source page also supports cuTile kernels and performance metrics on the source-line level, similar to CUDA C kernels.
How developers can get cuTile
To run cuTile Python programs, you would like the next:
- A GPU with compute capability 10.x or 12.x (in future CUDA releases, we’ll add support for extra GPU architectures)
- NVIDIA Driver R580 or later (R590 is required for tile-specific developer tools support)
- CUDA Toolkit 13.1 or later
- Python version 3.10 or higher
- The cuTile Python package:
pip install cuda-tile
Start
Try a number of videos to assist you learn more:
Also, take a look at the cuTile Python documentation.
You’re now able to try the sample programs on GitHub and begin programming in cuTile Python.
