NVIDIA CUDA Tile is a GPU-based programming model that targets portability for NVIDIA Tensor Cores, unlocking peak GPU performance. Considered one of the nice things about CUDA Tile is that you would be able to construct your individual DSL on top of it.
This post shares the work NVIDIA is doing to integrate CUDA Tile as a backend for OpenAI Triton, an open source Python DSL designed to put in writing DL kernels for GPUs. OpenAI Triton supports tiled computation, a way that divides data and computational tasks into small blocks. Triton comprises an MLIR-based compiler that generates PTX. This permits researchers without CUDA experience to put in writing efficient GPU code.
What are CUDA Tile and CUDA Tile IR?
CUDA Tile extends the CUDA programming model to enable first-class support for tile programming. Introduced in CUDA 13.1, CUDA Tile represents a paradigm shift in GPU programming. Quite than requiring developers to think by way of individual threads through the SIMT model, the tile-based model allows computation to be expressed at a better level of abstraction.
You just specify operations on data blocks (tiles), while the compiler and runtime system mechanically handle thread scheduling, hardware mapping, and resource allocation. This design concurrently reduces programming complexity and enables more aggressive compiler optimizations.
CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure. CUDA Tile development is driven by the CUDA Tile IR specification, which defines the formal semantics, operations, and kind system for tile-based computations on NVIDIA GPUs.
What’s Triton-to-TileIR?
Triton-to-TileIR backend is a bridge for Triton that allows it to focus on CUDA Tile IR (as an alternative of PTX). It extends the Triton compiler ecosystem, enabling developers to compile and execute GPU kernels written in OpenAI Triton to the newly introduced CUDA Tile IR backend. This bridges high-level programming languages (Triton) with the NVIDIA next-generation GPU programming model, offering a seamless path to leverage modern hardware capabilities without requiring code rewrites.
As GPU programming continues to evolve beyond traditional SIMT models toward tile-based abstractions, this integration enables developers to profit from Triton’s accessible Python syntax while getting access to TileIR-native support for Tensor Cores and architectural portability.
Triton-to-TileIR democratizes access to those latest capabilities. Notably, Triton itself is fundamentally a tile-based programming language—developers express computations by way of blocks (tiles) of knowledge quite than individual threads, which is conceptually aligned with the CUDA Tile IR.
This provides a direct backend compilation path: as an alternative of compiling Triton’s tile-level abstractions all the way down to thread-level SIMT code, Triton-to-TileIR preserves the tile-level semantics and compiles on to CUDA Tile IR, which natively understands tile-granularity computation.
The prevailing Triton user community can leverage the benefits of CUDA Tile IR without learning latest languages or rewriting existing code. A straightforward environment variable configuration switches the compilation pipeline from PTX backend to the CUDA Tile IR backend, unlocking improved performance and future-proof architecture compatibility.
Triton users will give you the option to pick which backend (PTX backend or CUDA Tile IR backend) to make use of on a per-kernel basis of their applications.
Development road map of Triton-to-TileIR
As an incubator project throughout the triton-lang organization, Triton-to-TileIR is in lively development. The repository serves as a collaboration space for implementing and refining the CUDA Tile IR backend before potential integration into the predominant Triton compiler.
Several technical workstreams likely comprise the event road map, including:
- Core conversion infrastructure: Implementing MLIR dialect conversion patterns to map Triton operations to CUDA Tile IR equivalents
- Testing and validation: Developing comprehensive test suites to confirm semantic correctness of the transformation, including edge cases on top of things flow, memory access patterns, and numerical precision
- Performance benchmarking: Establishing performance baselines comparing TileIR-compiled kernels against PTX-compiled equivalents across diverse operations (matrix multiplication, convolutions, element-wise operations, reductions, and so forth)
- Open source project integration: Coordinating with the open source community to enable higher support for the CUDA Tile IR backend in open source projects, resembling Helion
How one can use Triton-to-TileIR
Triton-to-TileIR currently supports only source-based compilation. Prebuilt binaries are usually not available, requiring you to construct the project from source in your local environment.
Prerequisites:
- CUDA version: CUDA 13.1 or higher
- GPU architecture: NVIDIA Blackwell GPUs (for instance, GeForce RTX 5080); previous GPU architectures will probably be enabled in upcoming CUDA releases
Construct from source
When prerequisites are satisfied, clone and construct the project from source:
# Clone the repository
git clone https://github.com/triton-lang/Triton-to-tile-IR.git
cd Triton-to-tile-IR
# Construct and install
# Specific construct instructions must be followed based on the project's README
pip install -e .
Note that detailed construct steps may vary. Seek the advice of the Triton-to-TileIR README and construct documentation for architecture-specific configurations, dependency management, and troubleshooting guidance.
Confirm Tile IR compilation
After constructing, confirm the installation by running the vector addition tutorial and confirming that the Tile IR backend is getting used:
# Navigate to the tutorial directory
cd python/tutorials
# Run the vector addition example with Tile IR enabled
export ENABLE_TILE=1
python 01-vector-add.py
When the Tile IR backend is lively, Triton caches compiled kernels with .tileIR file extensions as an alternative of the usual .cubin files utilized by the SIMT backend. Check for these cached files:
# Find the Triton cache directory (typically in ~/.triton/cache)
Limitations of Triton-to-TileIR
While Triton-to-TileIR opens promising latest possibilities, the project stays in a comparatively early development stage with several known constraints, including unsupported operations and temporary performance issues.
Unsupported operations
Not all Triton-supported operations are yet implemented within the Tile IR backend. Learn more about operations and features not yet supported or fully supported.
As CUDA continues to release latest versions, the compatibility of the Triton CUDA Tile IR backend will proceed to enhance.
Tensor-of-pointer degradation suboptimal performance
The “tensor-of-pointer” pattern in Triton—where tensors are composed of pointers for describing memory access patterns—demonstrates suboptimal performance on the Tile IR backend with CUDA 13.1. That is a short lived performance situation. For impacted workloads, you possibly can:
- Temporarily fall back to the SIMT backend for certain critical operations
- Await forthcoming optimization passes in future project releases
- Refine code to adopt the TMA load/store API
Regarding the last point, refining code to adopt the TMA load/store API: most of the tensors loaded in kernels have contiguous tiles and well‑defined shapes and strides. In consequence, materializing a tensor-of-pointers contained in the kernel is not any longer crucial. As an alternative, this layout information could be passed to the TMA load/store API, allowing the Tile IR backend to improve performance.
For instance, a typical tensor-of-pointers pattern may appear to be the next:
# Before: tensor-of-pointer style
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
offs_k = tl.arange(0, BLOCK_K)
a_ptrs = a_ptr + (offs_m[:, None] * stride_am
+ offs_k[None, :] * stride_ak)
b_ptrs = b_ptr + (offs_k[:, None] * stride_bk
+ offs_n[None, :] * stride_bn)
a = tl.load(a_ptrs)
b = tl.load(b_ptrs)
Here, every element in a_ptrs is an explicit pointer computed within the kernel, though the tile itself is contiguous and its layout could be fully described by (shape, strides, block_shape).
With TMA, the identical operation could be rewritten as:
desc_a = tl.make_tensor_descriptor(
a, # base pointer
shape=(M, K),
strides=(stride_am, stride_ak),
block_shape=(BLOCK_M, BLOCK_K) # tile size
)
desc_b = tl.make_tensor_descriptor(
b, shape=(K, N),
strides=(stride_bk, stride_bn),
block_shape=(BLOCK_K, BLOCK_N)
)
offs_m = pid_m * BLOCK_M
offs_n = pid_n * BLOCK_N
a_tile = desc_a.load([offs_m, 0]) # [BLOCK_M, BLOCK_K]
b_tile = desc_b.load([0, offs_n]) # [BLOCK_K, BLOCK_N]
desc_c.store([offs_m, offs_n], acc) # TMA-backed store
Learn more about Triton-to-TileIR
The Triton-to-TileIR project represents a big step within the evolution of GPU programming, bridging the gap between developer productivity and hardware efficiency. By enabling Triton’s accessible, tile-oriented programming model to focus on CUDA Tile IR virtual instruction set, the combination guarantees to deliver performance, portability, and future-readiness for machine learning practitioners and GPU developers.
For developers already using Triton, the TileIR backend will offer a pathway to leverage next-generation GPU architectures with minimal code changes. For the broader GPU programming ecosystem, this collaboration demonstrates how strategic partnerships between language designers and hardware vendors can create compounding advantages—making advanced hardware capabilities accessible without sacrificing the high-level abstractions that enable rapid innovation.
Because the project matures and moves from incubation to production readiness, it would be fascinating to look at how the combination influences each Triton’s adoption and the broader trajectory of tile-based GPU programming. The last word success metric will probably be easy: can researchers with limited GPU expertise write Triton code that executes with near-optimal performance on NVIDIA GPUs.
To learn more, take a look at the triton-lang/Triton-to-tile-IR GitHub repo and Performance Tuning Suggestions for CUDA Tile IR Backend.
