How one can Write High-Performance Matrix Multiply in NVIDIA CUDA Tile

-


This blog post is an element of a series designed to assist developers learn NVIDIA CUDA Tile programming for constructing high-performance GPU kernels, using matrix multiplication as a core example.

On this post, you’ll learn:

  • How one can implement high-performance matrix multiplication using NVIDIA cuTile: Understand the flow of Tile loading, computation, and storage.
  • In regards to the block-level parallel programming mindset: Shift from thread-level pondering to block-level pondering.
  • Best practices for Tile programming: Learn performance optimization strategies from the code.

Before you start, make sure your environment meets the next requirements (see the quickstart for more information):

Environment requirements:

  • CUDA 13.1 or higher
  • GPU architecture NVIDIA Blackwell (e.g., NVIDIA RTX 50 series)
  • Python: 3.10 or higher

Install cuTile Python:

Note: cuTile is the next-generation GPU programming framework for NVIDIA. While it only supports optimization for the Blackwell (compute capabilities 10.x and 12.x) architecture, support for more architectures will likely be provided in upcoming releases of the CUDA Toolkit.

What’s matrix multiplication?

Matrix multiplication is a fundamental operation in modern technical computing. It’s the operation that’s the idea for solving systems of equations. It underpins graphics, simulations, optimization, and most of machine learning, and it maps well to high-performance hardware like GPUs.  

Given input matrices A (MxK) and B (KxN), the formula for calculating a component within the result matrix C (MxN) is as follows.

C_{ij} = sum_{k=1}^{K} A_{ik} B_{kj}C_{ij} = sum_{k=1}^{K} A_{ik} B_{kj}

From the formula, you may see that a component of C is computed by taking the dot product of a row of A and a column of B.

Tile programming can simplify the implementation while achieving excellent performance by dividing the output matrix into multiple tiles. Each Block is chargeable for the calculation of 1 output tile, and cuTile routinely handles memory access and thread synchronization. Specifically:

  1. Each Block processes a (tm × tn) tile of the output matrix C.
  2. Loop over the K dimension, loading corresponding tiles of A and B one after the other.
  3. Use ct.mma() to perform matrix multiply-accumulate (routinely invoking Tensor Cores).
  4. Finally, store the collected results back in global memory.

Figure 1 shows the calculation process, which is like an element-by-element algorithm, but on this case, the tiles take the place of individual elements

 Image of tile-based matrix multiply showing how a tile of A is multiplied by a tile of B to generate a portion of the tile of C. Image of tile-based matrix multiply showing how a tile of A is multiplied by a tile of B to generate a portion of the tile of C.
Figure 1. Illustration of matrix multiply (A * B = C) broken into tiles

GPU kernel implementation

Having described the core idea, let’s take a look at the whole implementation code. The code is split into two parts: the kernel running on the GPU and the launch code on the CPU, as shown within the code that follows.

import cuda.tile as ct
from math import ceil
import torch

# Type alias for compile-time constants
ConstInt = ct.Constant[int]

# Step 1: Define the kernel
@ct.kernel
def matmul_kernel(A, B, C, tm: ConstInt, tn: ConstInt, tk: ConstInt):
    # 1.1 Get block ID and map to output tile position
    # inside swizzle_2d, we access ct.bid(0) and output bidx and bidy
    bidx, bidy = swizzle_2d(M, N, tm, tn, GROUP_SIZE_M)

    # 1.2 Calculate the variety of tiles along the K dimension
    num_tiles_k = ct.num_tiles(A, axis=1, shape=(tm, tk))

    # 1.3 Initialize accumulator
    accumulator = ct.full((tm, tn), 0, dtype=ct.float32)

    # 1.4 Loop over K dimension
    for k in range(num_tiles_k):
        # Load tiles from A and B
        a = ct.load(A, index=(bidx, k), shape=(tm, tk))
        b = ct.load(B, index=(k, bidy), shape=(tk, tn))

        # Matrix multiply-accumulate
        accumulator = ct.mma(a, b, accumulator)

    # 1.5 Store result
    ct.store(C, index=(bidx, bidy), tile=accumulator)

# Step 2: Launch the kernel
def cutile_matmul(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
    # Select tile sizes
    tm, tn, tk = 128, 256, 64  # for float16

    # Calculate grid dimensions
    grid_x = ceil(m / tm)
    grid_y = ceil(n / tn)
    grid = (grid_x * grid_y, 1, 1)

    # Create output and launch
    C = torch.empty((m, n), device=A.device, dtype=A.dtype)
    ct.launch(stream, grid, matmul_kernel, (A, B, C, tm, tn, tk))
    return C

Now, let’s break down each key part step-by-step.

1. Define the GPU kernel

In cuTile, the @ct.kernel decorator is used to mark a normal Python function as a GPU kernel:

@ct.kernel
def matmul_kernel(A, B, C, tm: ConstInt, tn: ConstInt, tk: ConstInt):
    # Kernel code here

This decorator indicates that:

  • This function will execute on the GPU.
  • Each block will run an independent instance of this function.
  • It could possibly’t be called directly and should be launched using ct.launch().

2. Compile-time optimization: Constant type annotation

Notice that the parameters tm, tn, and tk use a special type annotation ct.Constant[int]:

ConstInt = ct.Constant[int]  # Define type alias

def matmul_kernel(A, B, C,
                  tm: ConstInt,  # Tile size along M dimension
                  tn: ConstInt,  # Tile size along N dimension
                  tk: ConstInt): # Tile size along K dimension

This means they’re compile-time constants. cuTile will generate specialized machine code for various tile size values, allowing the compiler to:

  • Perform loop unrolling.
  • Optimize memory access patterns.
  • Generate optimal Tensor Core instructions.

3. Determining work scope: Block ID mapping

Each block computes a particular tile of the output matrix. Through the swizzle_2d() function, we obtain the index of the block currently being processed:

def swizzle_2d(M, N, tm, tn, GROUP_SIZE_M):
    # Get the worldwide IDs of the present CUDA block (CTA) in a 1D grid.
    bid = ct.bid(0)
    return swizzle_2d_from_bid(M, N, tm, tn, GROUP_SIZE_M, bid)

bidx, bidy = swizzle_2d(M, N, tm, tn, GROUP_SIZE_M)

The function of this code is to determine which tile of the output matrix the present block should process. To know this process, let’s start with the grid division on the host side.

Step 1: Host-side grid division

When launching the kernel on the host side (explained in Section 3), calculate what number of blocks are needed:

grid_x = ceil(m / tm)  # Variety of Blocks needed for M dimension
grid_y = ceil(n / tn)  # Variety of Blocks needed for N dimension
grid_size = grid_x * grid_y  # Total Blocks
grid = (grid_size, 1, 1)  # Defined as a 1D grid
  • m and n: Rows and columns of output matrix C.
  • tm: Output tile size in row direction (M dimension) processed by each block.
  • tn: Output tile size in column direction (N dimension) processed by each block.

Logically, launch grid_x * grid_y blocks and flatten them right into a 1D grid: grid = (grid_size, 1, 1).

Step 2: Getting block ID in kernel

Contained in the kernel, each block gets its unique identifier via ct.bid(0):

bid = ct.bid(0)  # Return value range: [0, grid_size-1]
  • ct.bid(0) queries the present block’s ID within the x-axis dimension.
  • Parameter 0 represents the primary dimension (x-axis), corresponding to the primary element within the grid definition (grid_size, 1, 1).
  • Each block gets a novel 1D coordinate: bid = 0, 1, 2, …, grid_size-1.

Step 3: Mapping 1D block ID to 2D tile coordinates

The issue now’s that the block ID (bid) is 1D, however the output matrix is 2D. We’d like to know which row and column tile this Block should process. The swizzle_2d_from_bid() function determines which row and column tile the block is chargeable for processing.

bidx, bidy = swizzle_2d_from_bid(M, N, tm, tn, GROUP_SIZE_M, bid)

Output result:

  • bidx: The row index (M dimension) of the output tile the present block is chargeable for. Range: [0, grid_x-1].
  • bidy: The column index (N dimension) of the output tile the present block is chargeable for. Range: [0, grid_y-1].

The precise mapping logic involves swizzling (used to enhance memory access efficiency), which we’ll explain intimately in Section 4. For now, just understand that it converts a 1D Block ID into 2D tile coordinates.

5. Preparing the accumulator: Initializing output tile

Before looping through the K dimension, you might want to create an accumulator to store intermediate results:

num_tiles_k = ct.num_tiles(A, axis=1, shape=(tm, tk))
accumulator = ct.full((tm, tn), 0, dtype=ct.float32)
  • num_tiles_k: Calculates what number of tiles have to be processed within the K dimension.
  • accumulator: A zero matrix of shapes (tm, tn) used to build up results.
  • Using float32 ensures numerical precision and avoids accumulation errors.

6. Core computation loop: Traversing the K dimension

That is the core of matrix multiplication. Now loop through every tile within the K dimension and accumulate the outcomes:

for k in range(num_tiles_k):
    # Load tiles
    a = ct.load(A, index=(bidx, k), shape=(tm, tk), padding_mode=zero_pad)
    b = ct.load(B, index=(k, bidy), shape=(tk, tn), padding_mode=zero_pad)

    # Accumulate
    accumulator = ct.mma(a, b, accumulator)

Loading data:

  • ct.load(A, index=(bidx, k), shape=(tm, tk)): Loads a tile from matrix A.
  • index=(bidx, k): Specifies the coordinates of the tile to load in tile space.
  • shape=(tm, tk): The scale of the tile.
  • padding_mode=zero_pad: Fills with zeros if the load data is out of bounds.

Matrix multiply-accumulate:

  • ct.mma(a, b, accumulator): Multiplies a * b, adds to accumulator, and stores the lead to accumulator (mma stands for matrix multiply-accumulate)
  • When the shapes of a and b meet Tensor Core requirements, cuTile routinely invokes the GPU’s Tensor Cores to speed up this operation.

After the loop ends, the accumulator stores the whole result for the output tile.

  1. Writing back results: Storing to global memory

Finally, write the calculated result back to global memory:

accumulator = ct.astype(accumulator, C.dtype)
ct.store(C, index=(bidx, bidy), tile=accumulator)
  • First, convert the float32 accumulator to the output matrix data type.
  • Use ct.store() to jot down the tile back to the corresponding position in global memory.

Launching the kernel: Host-side code

Now launch the kernel from the host. First, take a look at the whole code.

def cutile_matmul(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
    # Determine tile sizes based on dtype
    if A.dtype.itemsize == 2:  # float16/bfloat16
        tm, tn, tk = 128, 256, 64
    else:  # float32
        tm, tn, tk = 32, 32, 32

    m, k = A.shape
    _, n = B.shape

    # Calculate grid dimensions
    grid_x = ceil(m / tm)
    grid_y = ceil(n / tn)
    grid_size = grid_x * grid_y
    grid = (grid_size, 1, 1)

    # Create output tensor
    C = torch.empty((m, n), device=A.device, dtype=A.dtype)

    # Launch kernel
    ct.launch(torch.cuda.current_stream(), grid, matmul_kernel,
              (A, B, C, tm, tn, tk))

    return C

Launching the kernel on the host side requires three key steps:

Step 1: Calculate grid size

Based on the input matrix dimensions and tile size, calculate what number of blocks are needed:

m, k = A.shape  # Matrix A dimensions: m rows, k columns
_, n = B.shape  # Matrix B dimensions: k rows, n columns

# Calculate variety of Blocks needed
grid_x = ceil(m / tm)  # What number of tiles needed for M dimension
grid_y = ceil(n / tn)  # What number of tiles needed for N dimension
grid_size = grid_x * grid_y  # Total Blocks
grid = (grid_size, 1, 1)  # Defined as 1D grid
  • ceil() rounds up to make sure all elements are covered (even when matrix dimensions aren’t divisible by tile size).
  • Flattening the 2D block layout right into a 1D grid simplifies launch logic.

Step 2: Set tile size (compile-time constants)

Select appropriate tile dimensions based on data type:

if A.dtype.itemsize == 2:  # float16/bfloat16 (2 bytes per element)
    tm, tn, tk = 128, 256, 64
else:  # float32 (4 bytes per element)
    tm, tn, tk = 32, 32, 32

These parameters are passed to the kernel as compile-time constants:

  • tm: Output tile rows (M dimension).
  • tn: Output tile columns (N dimension).
  • tk: Size of tile loaded every time in K dimension.

Note: The tile size configuration here is an example. In practice, different GPU architectures require different parameter configurations to attain optimal performance. Best configurations rely upon M/N/K sizes, GPU architecture, shared memory size, register count, SM count, etc. In development, it is strongly recommended to make use of performance evaluation tools (like NVIDIA Nsight Compute) to search out optimal parameters. TileGym provides an autotuner to routinely obtain optimal parameters.

Step 3: Call ct.launch() to start out kernel

C = torch.empty((m, n), device=A.device, dtype=A.dtype)  # Create output tensor

ct.launch(
    torch.cuda.current_stream(),  # CUDA stream
    grid,                          # Grid dimensions: (grid_size, 1, 1)
    matmul_kernel,                # Kernel function
    (A, B, C, tm, tn, tk)         # Arguments passed to kernel
)
  • Stream: Specifies which CUDA stream the kernel executes on (for asynchronous execution and multi-stream concurrency).
  • Grid: Defines what number of blocks to launch.
  • Kernel function: The GPU kernel to execute (function decorated with @ct.kernel).

Argument tuple: All parameters passed to the kernel; tm, tn, and tk will likely be recognized by the compiler as constants.

Performance optimization: Swizzle

Earlier swizzling was introduced to enhance performance.  The code for swizzle_2d_from_bid is shown.

def swizzle_2d_from_bid(M, N, tm, tn, GROUP_SIZE_M, bid):
    # Get the worldwide IDs of a given CUDA block in a 1D grid.
    num_bid_m = ct.cdiv(M, tm)
    num_bid_n = ct.cdiv(N, tn)
    num_bid_in_group = GROUP_SIZE_M * num_bid_n
    group_id = bid // num_bid_in_group
    first_bid_m = group_id * GROUP_SIZE_M
    group_size_m = min(num_bid_m - first_bid_m, GROUP_SIZE_M)
    bid_m = first_bid_m + (bid % group_size_m)
    bid_n = (bid % num_bid_in_group) // group_size_m
    return bid_m, bid_n

How does swizzle improve performance?

It re-maps block IDs to tile index through grouping and interleaving to make use of cache more efficiently.

Using 4 elements (shaded areas) of the output matrix for example, the figure compares linear versus swizzled memory access

Graphic showing how linear row access uses 4 elements of matrix A and 16 elements of matrix B, compared to tiled block access, which uses 8 elements of matrix A and 8 elements of matrix B, a 20% reduction in memory use.Graphic showing how linear row access uses 4 elements of matrix A and 16 elements of matrix B, compared to tiled block access, which uses 8 elements of matrix A and 8 elements of matrix B, a 20% reduction in memory use.
Figure 2. Visual comparison of linear row access versus tiled block access

Method 1: Linear row access

  • Calculates one row of information within the result matrix (e.g., 4 elements).
  • Must read 4 blocks from the left matrix + all 16 blocks from the fitting matrix.
  • Total memory access: 20 data blocks.
  • Right matrix data is loaded regularly and replaced quickly, leading to a low cache hit rate.

Method 2: Swizzling / tiled block access 

  • Reorganizes computation into 2×2 local blocks.
  • Only must read eight relevant blocks from the left matrix + eight relevant blocks from the fitting matrix.
  • Total memory access: 16 data blocks (20% reduction).
  • Higher data locality leads to a higher cache hit rate.

Performance benchmarks

To confirm the performance of the implemented matrix multiplication kernel, it was tested on an NVIDIA GeForce RTX 5080 (compute capability 12.0). Yow will discover the whole benchmark code within the TileGym repository.  Be sure that to follow the installation instructions, and you then can run this and the opposite tests following the Quick Start instructions.

Test configuration:

  • Data Type: float16
  • Matrix shape: Standard square matrix (N×N)
  • Test sizes: N = 1024, 2048, 4096, 8192, 16384 (i.e., 2^10 to 2^14)

The next figure shows the performance under different matrix sizes.

A figure showing cuTile and PyTorch performance data, showing that while PyTorch is slightly better performance, typically cuTile is 90% or better than that of PyTorch performance.A figure showing cuTile and PyTorch performance data, showing that while PyTorch is slightly better performance, typically cuTile is 90% or better than that of PyTorch performance.
Figure 3. cuTile and PyTorch TFLOP/s performance data in comparison with matrix size on NVIDIA GeForce RTX 5080

The outcomes show that:

  • At large matrix scales, the cuTile implementation can fully utilize the GPU’s computing power.
  • Through appropriate tile size configuration and swizzle optimization, the cuTile implementation achieves over 90% of the performance in comparison with SOTA implementations (PyTorch calling cuBLAS).

Summary 

This classic matrix multiplication example shows the whole means of implementing a GPU kernel using cuTile. Although matrix multiplication is straightforward, it accommodates the core ideas of Tile programming. Mastering these concepts will enable you to implement various high-performance GPU kernels using cuTile. Try the complete matrix multiply example and more within the TileGym repo, and begin writing high-performance tile code today.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x