cuTile.jl Brings NVIDIA CUDA Tile-Based Programming to Julia

-


NVIDIA CUDA Tile is one of the crucial significant additions to NVIDIA CUDA programming and unlocks automatic access to tensor cores and other specialized hardware. Earlier this 12 months, NVIDIA released cuTile for Python, giving Python developers a natural solution to write high-performance GPU kernels. 

Now, the identical programming model is accessible in Julia through cuTile.jl. On this blog post, we’ll explore how cuTile.jl simplifies the event of high-performance CUDA kernels, show its idiomatic Julia syntax, and discuss its performance parity with the prevailing cuTile Python implementation.

What’s tile-based GPU programming?

Traditional GPU programming with CUDA requires developers to take into consideration threads, warps, and memory hierarchies. While powerful, this approach requires the programmer to map algorithms onto hardware efficiently. With CUDA Tile, developers describe operations on tiles of information, and the compiler handles the mapping to hardware.

Consider vector addition. In the normal GPU programming model, using CUDA.jl, the programmer must manage individual threads explicitly:

using CUDA

function vadd(a, b, c, n)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i <= n
        @inbounds c[i] = a[i] + b[i]
    end
    return
end

threads = 512
blocks = cld(vector_size, threads)
@cuda threads blocks vadd(a, b, c, vector_size)

With CUDA Tile through cuTile.jl, the identical operations are actually expressed on the tile level, hiding details like index calculations or out-of-bounds checks:

import cuTile as ct

function vadd(a, b, c, tile_size)
    pid = ct.bid(1)
    tile_a = ct.load(a, pid, (tile_size,))
    tile_b = ct.load(b, pid, (tile_size,))
    ct.store(c, pid, tile_a + tile_b)
    return
end

tile_size = 1024
grid = cld(vector_size, tile_size)
ct.launch(vadd, grid, a, b, c, ct.Constant(tile_size))

Compare this with the Python equivalent:

@ct.kernel
def vadd(a, b, c, tile_size: ct.Constant[int]):
    pid = ct.bid(0)
    tile_a = ct.load(a, index=(pid,), shape=(tile_size,))
    tile_b = ct.load(b, index=(pid,), shape=(tile_size,))
    ct.store(c, index=(pid,), tile=tile_a + tile_b)

tile_size = 1024
grid = ceil(vector_size / tile_size)
ct.launch(stream, grid, vadd, (a, b, c, tile_size))

The 2 are strikingly similar, and that is deliberate. cuTile.jl keeps the abstraction level of kernels similar to those written in cuTile Python, making it easy to port code over or learn from the cuTile Python documentation. At the identical time, it uses Julia idioms wherever possible to make the package intuitive for Julia programmers, including 1-based indexing and broadcast expressions for element-wise operations.

Idiomatic Julia kernels

Where this really shines is in kernels that transcend easy loads and stores. The next is a row-normalization kernel—the core of layer normalization, without the weights and bias:

function normalize_rows(X, Y, tile_n)
    bid = ct.bid(1)
    tile = ct.load(X, (bid, 1), (1, tile_n))
    mean = sum(tile; dims=2) / size(X, 2)
    centered = tile .- mean
    var = sum(centered .^ 2.0f0; dims=2) / size(X, 2)
    ct.store(Y, (bid, 1), centered ./ sqrt.(var .+ 1f-5))
    return
end

In this instance, sum, size, and sqrt are standard Julia functions augmented to work on tiles. The dots (.^, .-, ./) are standard Julia broadcasting syntax, showing the operation is applied element-wise. The kernel reads like regular Julia array code. The closer cuTile.jl kernels are to unusual Julia, the better it’s to share and reuse code between the CPU and GPU.

Performance of cuTile.jl

cuTile.jl targets the identical NVIDIA Tile IR backend as cuTile Python, so each packages produce the identical sort of GPU machine code. On an NVIDIA GeForce RTX 5080 (compute capability 12.0, NVIDIA Blackwell architecture), compute-intensive kernels achieve performance parity with the Python implementation:

Kernel cuTile.jl cuTile Python cuTile.jl in comparison with
cuTile Python
Vector addition 838 GB/s 843 GB/s 99%
Matrix transpose 797 GB/s 812 GB/s 98%
Matrix multiplication 50.9 TFLOPS 50.5 TFLOPS 100%
Batch matrix multiply 43.0 TFLOPS 47.5 TFLOPS 91%
Table 1. Performance comparison of common GPU kernels when using Julia or Python because the front-end

Some kernels with more complex control flow, similar to layer normalization or FFT, don’t reach full performance parity, because the cuTile.jl compiler remains to be maturing. These are tracked as known issues and are actively being worked on.

How cuTile.jl works

cuTile.jl uses a custom Julia compiler that intercepts standard library calls similar to +, sum, reshape, and routes them to Tile IR operations. The resulting IR is then lowered to Tile IR bytecode, the identical binary format that cuTile Python produces. From there, the NVIDIA tileiras compiler handles the ultimate compilation to GPU machine code.

The generated Tile IR might be inspected for any kernel:

julia> ct.@device_code_tiled ct.launch(vadd, grid, a, b, c, ct.Constant(16))
cuda_tile.module @kernels {
  entry @vadd(%arg0: tile>, %arg1: tile, ...) {
    ...
    return
  }
}

This transparency is beneficial for debugging and for understanding how high-level Julia code maps to tile operations.

Current status of cuTile.jl

cuTile.jl is an experimental, open-source package under lively development at JuliaGPU/cuTile.jl. It supports a broad set of tile operations similar to memory access, arithmetic, reductions, scans, matrix multiply, shape manipulation, and atomics. It also includes working examples for vector addition, matrix multiplication, transpose, batch matrix multiply, layer normalization, and FFT.

That said, that is early-stage software, and:

  • Not all cuTile features are implemented.
  • Some Julia language features (notably iterator-based ‘for’ loops) aren’t supported in kernels or generate inefficient code
  • The combination with CUDA.jl needs to enhance to facilitate coexistence with SIMT kernels.
  • APIs may change all of sudden.

The project builds on Julia’s existing GPU ecosystem, integrating with CUDA.jl for array management and kernel launching. Users who’re already writing GPU code in Julia with CUDA.jl will find the transition to tile-based programming straightforward.

Getting began

Identical to cuTile Python, cuTile.jl requires an NVIDIA Blackwell GPU and an NVIDIA driver for CUDA 13 or higher. The package also requires Julia 1.11 or higher.

Launch Julia, and press `]` from the REPL to enter the integrated package manager to put in cuTile.jl:

pkg> add cuTile

pkg> # when you want, run the test suite
     test cuTile

The GitHub comprises a full list of supported operations and detailed documentation on how cuTile.jl differs from each cuTile Python and standard Julia.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x