This blog post is an element of a series designed to assist developers learn NVIDIA CUDA Tile programming for constructing high-performance GPU kernels, using matrix multiplication as a core example.
On this post, you’ll learn:
- How one can implement high-performance matrix multiplication using NVIDIA cuTile: Understand the flow of Tile loading, computation, and storage.
- In regards to the block-level parallel programming mindset: Shift from thread-level pondering to block-level pondering.
- Best practices for Tile programming: Learn performance optimization strategies from the code.
Before you start, make sure your environment meets the next requirements (see the quickstart for more information):
Environment requirements:
- CUDA 13.1 or higher
- GPU architecture NVIDIA Blackwell (e.g., NVIDIA RTX 50 series)
- Python: 3.10 or higher
Install cuTile Python:
Note: cuTile is the next-generation GPU programming framework for NVIDIA. While it only supports optimization for the Blackwell (compute capabilities 10.x and 12.x) architecture, support for more architectures will likely be provided in upcoming releases of the CUDA Toolkit.
What’s matrix multiplication?
Matrix multiplication is a fundamental operation in modern technical computing. It’s the operation that’s the idea for solving systems of equations. It underpins graphics, simulations, optimization, and most of machine learning, and it maps well to high-performance hardware like GPUs. Â
Given input matrices A (MxK) and B (KxN), the formula for calculating a component within the result matrix C (MxN) is as follows.
From the formula, you may see that a component of C is computed by taking the dot product of a row of A and a column of B.
Tile programming can simplify the implementation while achieving excellent performance by dividing the output matrix into multiple tiles. Each Block is chargeable for the calculation of 1 output tile, and cuTile routinely handles memory access and thread synchronization. Specifically:
- Each Block processes a (
tm×tn) tile of the output matrix C. - Loop over the K dimension, loading corresponding tiles of A and B one after the other.
- Use
ct.mma()to perform matrix multiply-accumulate (routinely invoking Tensor Cores). - Finally, store the collected results back in global memory.
Figure 1 shows the calculation process, which is like an element-by-element algorithm, but on this case, the tiles take the place of individual elements


GPU kernel implementation
Having described the core idea, let’s take a look at the whole implementation code. The code is split into two parts: the kernel running on the GPU and the launch code on the CPU, as shown within the code that follows.
import cuda.tile as ct
from math import ceil
import torch
# Type alias for compile-time constants
ConstInt = ct.Constant[int]
# Step 1: Define the kernel
@ct.kernel
def matmul_kernel(A, B, C, tm: ConstInt, tn: ConstInt, tk: ConstInt):
# 1.1 Get block ID and map to output tile position
# inside swizzle_2d, we access ct.bid(0) and output bidx and bidy
bidx, bidy = swizzle_2d(M, N, tm, tn, GROUP_SIZE_M)
# 1.2 Calculate the variety of tiles along the K dimension
num_tiles_k = ct.num_tiles(A, axis=1, shape=(tm, tk))
# 1.3 Initialize accumulator
accumulator = ct.full((tm, tn), 0, dtype=ct.float32)
# 1.4 Loop over K dimension
for k in range(num_tiles_k):
# Load tiles from A and B
a = ct.load(A, index=(bidx, k), shape=(tm, tk))
b = ct.load(B, index=(k, bidy), shape=(tk, tn))
# Matrix multiply-accumulate
accumulator = ct.mma(a, b, accumulator)
# 1.5 Store result
ct.store(C, index=(bidx, bidy), tile=accumulator)
# Step 2: Launch the kernel
def cutile_matmul(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
# Select tile sizes
tm, tn, tk = 128, 256, 64 # for float16
# Calculate grid dimensions
grid_x = ceil(m / tm)
grid_y = ceil(n / tn)
grid = (grid_x * grid_y, 1, 1)
# Create output and launch
C = torch.empty((m, n), device=A.device, dtype=A.dtype)
ct.launch(stream, grid, matmul_kernel, (A, B, C, tm, tn, tk))
return C
Now, let’s break down each key part step-by-step.
1. Define the GPU kernel
In cuTile, the @ct.kernel decorator is used to mark a normal Python function as a GPU kernel:
@ct.kernel
def matmul_kernel(A, B, C, tm: ConstInt, tn: ConstInt, tk: ConstInt):
# Kernel code here
This decorator indicates that:
- This function will execute on the GPU.
- Each block will run an independent instance of this function.
- It could possibly’t be called directly and should be launched using
ct.launch().
2. Compile-time optimization: Constant type annotation
Notice that the parameters tm, tn, and tk use a special type annotation ct.Constant[int]:
ConstInt = ct.Constant[int] # Define type alias
def matmul_kernel(A, B, C,
tm: ConstInt, # Tile size along M dimension
tn: ConstInt, # Tile size along N dimension
tk: ConstInt): # Tile size along K dimension
This means they’re compile-time constants. cuTile will generate specialized machine code for various tile size values, allowing the compiler to:
- Perform loop unrolling.
- Optimize memory access patterns.
- Generate optimal Tensor Core instructions.
3. Determining work scope: Block ID mapping
Each block computes a particular tile of the output matrix. Through the swizzle_2d() function, we obtain the index of the block currently being processed:
def swizzle_2d(M, N, tm, tn, GROUP_SIZE_M):
# Get the worldwide IDs of the present CUDA block (CTA) in a 1D grid.
bid = ct.bid(0)
return swizzle_2d_from_bid(M, N, tm, tn, GROUP_SIZE_M, bid)
bidx, bidy = swizzle_2d(M, N, tm, tn, GROUP_SIZE_M)
The function of this code is to determine which tile of the output matrix the present block should process. To know this process, let’s start with the grid division on the host side.
Step 1: Host-side grid division
When launching the kernel on the host side (explained in Section 3), calculate what number of blocks are needed:
grid_x = ceil(m / tm) # Variety of Blocks needed for M dimension
grid_y = ceil(n / tn) # Variety of Blocks needed for N dimension
grid_size = grid_x * grid_y # Total Blocks
grid = (grid_size, 1, 1) # Defined as a 1D grid
mandn: Rows and columns of output matrix C.tm: Output tile size in row direction (M dimension) processed by each block.tn: Output tile size in column direction (N dimension) processed by each block.
Logically, launch grid_x * grid_y blocks and flatten them right into a 1D grid: grid = (grid_size, 1, 1).
Step 2: Getting block ID in kernel
Contained in the kernel, each block gets its unique identifier via ct.bid(0):
bid = ct.bid(0) # Return value range: [0, grid_size-1]
ct.bid(0)queries the present block’s ID within the x-axis dimension.- Parameter 0 represents the primary dimension (x-axis), corresponding to the primary element within the grid definition
(grid_size, 1, 1). - Each block gets a novel 1D coordinate: bid = 0, 1, 2, …, grid_size-1.
Step 3: Mapping 1D block ID to 2D tile coordinates
The issue now’s that the block ID (bid) is 1D, however the output matrix is 2D. We’d like to know which row and column tile this Block should process. The swizzle_2d_from_bid() function determines which row and column tile the block is chargeable for processing.
bidx, bidy = swizzle_2d_from_bid(M, N, tm, tn, GROUP_SIZE_M, bid)
Output result:
- bidx: The row index (M dimension) of the output tile the present block is chargeable for. Range: [0, grid_x-1].
- bidy: The column index (N dimension) of the output tile the present block is chargeable for. Range: [0, grid_y-1].
The precise mapping logic involves swizzling (used to enhance memory access efficiency), which we’ll explain intimately in Section 4. For now, just understand that it converts a 1D Block ID into 2D tile coordinates.
5. Preparing the accumulator: Initializing output tile
Before looping through the K dimension, you might want to create an accumulator to store intermediate results:
num_tiles_k = ct.num_tiles(A, axis=1, shape=(tm, tk))
accumulator = ct.full((tm, tn), 0, dtype=ct.float32)
num_tiles_k: Calculates what number of tiles have to be processed within the K dimension.accumulator: A zero matrix of shapes (tm, tn) used to build up results.- Using float32 ensures numerical precision and avoids accumulation errors.
6. Core computation loop: Traversing the K dimension
That is the core of matrix multiplication. Now loop through every tile within the K dimension and accumulate the outcomes:
for k in range(num_tiles_k):
# Load tiles
a = ct.load(A, index=(bidx, k), shape=(tm, tk), padding_mode=zero_pad)
b = ct.load(B, index=(k, bidy), shape=(tk, tn), padding_mode=zero_pad)
# Accumulate
accumulator = ct.mma(a, b, accumulator)
Loading data:
ct.load(A, index=(bidx, k), shape=(tm, tk)): Loads a tile from matrix A.index=(bidx, k): Specifies the coordinates of the tile to load in tile space.shape=(tm, tk): The scale of the tile.padding_mode=zero_pad: Fills with zeros if the load data is out of bounds.
Matrix multiply-accumulate:
ct.mma(a, b, accumulator): Multipliesa*b, adds toaccumulator,and stores the lead toaccumulator(mmastands for matrix multiply-accumulate)- When the shapes of
aandbmeet Tensor Core requirements, cuTile routinely invokes the GPU’s Tensor Cores to speed up this operation.
After the loop ends, the accumulator stores the whole result for the output tile.
- Writing back results: Storing to global memory
Finally, write the calculated result back to global memory:
accumulator = ct.astype(accumulator, C.dtype)
ct.store(C, index=(bidx, bidy), tile=accumulator)
- First, convert the float32 accumulator to the output matrix data type.
- Use
ct.store()to jot down the tile back to the corresponding position in global memory.
Launching the kernel: Host-side code
Now launch the kernel from the host. First, take a look at the whole code.
def cutile_matmul(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
# Determine tile sizes based on dtype
if A.dtype.itemsize == 2: # float16/bfloat16
tm, tn, tk = 128, 256, 64
else: # float32
tm, tn, tk = 32, 32, 32
m, k = A.shape
_, n = B.shape
# Calculate grid dimensions
grid_x = ceil(m / tm)
grid_y = ceil(n / tn)
grid_size = grid_x * grid_y
grid = (grid_size, 1, 1)
# Create output tensor
C = torch.empty((m, n), device=A.device, dtype=A.dtype)
# Launch kernel
ct.launch(torch.cuda.current_stream(), grid, matmul_kernel,
(A, B, C, tm, tn, tk))
return C
Launching the kernel on the host side requires three key steps:
Step 1: Calculate grid size
Based on the input matrix dimensions and tile size, calculate what number of blocks are needed:
m, k = A.shape # Matrix A dimensions: m rows, k columns
_, n = B.shape # Matrix B dimensions: k rows, n columns
# Calculate variety of Blocks needed
grid_x = ceil(m / tm) # What number of tiles needed for M dimension
grid_y = ceil(n / tn) # What number of tiles needed for N dimension
grid_size = grid_x * grid_y # Total Blocks
grid = (grid_size, 1, 1) # Defined as 1D grid
ceil()rounds up to make sure all elements are covered (even when matrix dimensions aren’t divisible by tile size).- Flattening the 2D block layout right into a 1D grid simplifies launch logic.
Step 2: Set tile size (compile-time constants)
Select appropriate tile dimensions based on data type:
if A.dtype.itemsize == 2: # float16/bfloat16 (2 bytes per element)
tm, tn, tk = 128, 256, 64
else: # float32 (4 bytes per element)
tm, tn, tk = 32, 32, 32
These parameters are passed to the kernel as compile-time constants:
tm: Output tile rows (M dimension).tn: Output tile columns (N dimension).tk: Size of tile loaded every time in K dimension.
Note: The tile size configuration here is an example. In practice, different GPU architectures require different parameter configurations to attain optimal performance. Best configurations rely upon M/N/K sizes, GPU architecture, shared memory size, register count, SM count, etc. In development, it is strongly recommended to make use of performance evaluation tools (like NVIDIA Nsight Compute) to search out optimal parameters. TileGym provides an autotuner to routinely obtain optimal parameters.
Step 3: Call ct.launch() to start out kernel
C = torch.empty((m, n), device=A.device, dtype=A.dtype) # Create output tensor
ct.launch(
torch.cuda.current_stream(), # CUDA stream
grid, # Grid dimensions: (grid_size, 1, 1)
matmul_kernel, # Kernel function
(A, B, C, tm, tn, tk) # Arguments passed to kernel
)
- Stream: Specifies which CUDA stream the kernel executes on (for asynchronous execution and multi-stream concurrency).
- Grid: Defines what number of blocks to launch.
- Kernel function: The GPU kernel to execute (function decorated with @ct.kernel).
Argument tuple: All parameters passed to the kernel; tm, tn, and tk will likely be recognized by the compiler as constants.
Performance optimization: Swizzle
Earlier swizzling was introduced to enhance performance. The code for swizzle_2d_from_bid is shown.
def swizzle_2d_from_bid(M, N, tm, tn, GROUP_SIZE_M, bid):
# Get the worldwide IDs of a given CUDA block in a 1D grid.
num_bid_m = ct.cdiv(M, tm)
num_bid_n = ct.cdiv(N, tn)
num_bid_in_group = GROUP_SIZE_M * num_bid_n
group_id = bid // num_bid_in_group
first_bid_m = group_id * GROUP_SIZE_M
group_size_m = min(num_bid_m - first_bid_m, GROUP_SIZE_M)
bid_m = first_bid_m + (bid % group_size_m)
bid_n = (bid % num_bid_in_group) // group_size_m
return bid_m, bid_n
How does swizzle improve performance?
It re-maps block IDs to tile index through grouping and interleaving to make use of cache more efficiently.
Using 4 elements (shaded areas) of the output matrix for example, the figure compares linear versus swizzled memory access


Method 1: Linear row access
- Calculates one row of information within the result matrix (e.g., 4 elements).
- Must read 4 blocks from the left matrix + all 16 blocks from the fitting matrix.
- Total memory access: 20 data blocks.
- Right matrix data is loaded regularly and replaced quickly, leading to a low cache hit rate.
Method 2: Swizzling / tiled block accessÂ
- Reorganizes computation into 2×2 local blocks.
- Only must read eight relevant blocks from the left matrix + eight relevant blocks from the fitting matrix.
- Total memory access: 16 data blocks (20% reduction).
- Higher data locality leads to a higher cache hit rate.
Performance benchmarks
To confirm the performance of the implemented matrix multiplication kernel, it was tested on an NVIDIA GeForce RTX 5080 (compute capability 12.0). Yow will discover the whole benchmark code within the TileGym repository. Be sure that to follow the installation instructions, and you then can run this and the opposite tests following the Quick Start instructions.
Test configuration:
- Data Type: float16
- Matrix shape: Standard square matrix (N×N)
- Test sizes: N = 1024, 2048, 4096, 8192, 16384 (i.e., 2^10 to 2^14)
The next figure shows the performance under different matrix sizes.


The outcomes show that:
- At large matrix scales, the cuTile implementation can fully utilize the GPU’s computing power.
- Through appropriate tile size configuration and swizzle optimization, the cuTile implementation achieves over 90% of the performance in comparison with SOTA implementations (PyTorch calling cuBLAS).
SummaryÂ
This classic matrix multiplication example shows the whole means of implementing a GPU kernel using cuTile. Although matrix multiplication is straightforward, it accommodates the core ideas of Tile programming. Mastering these concepts will enable you to implement various high-performance GPU kernels using cuTile. Try the complete matrix multiply example and more within the TileGym repo, and begin writing high-performance tile code today.
