CuTe, a core component of CUTLASS 3.x, provides a unified algebra for describing data layouts and thread mappings, and abstracts complex memory access patterns into composable mathematical operations.
While CUTLASS 3.x and CuTe have empowered kernel developers to realize peak performance on Tensor Cores through intuitive abstractions, the extensive use of C++ templates has resulted in high compilation times. Moreover, the growing adoption of Python and just-in-time (JIT) compilation in each research and production generative AI workflows has driven the evolution and development of CUTLASS 4.
This post explains some great benefits of using CuTe DSL. We show that it offers a consistent API with C++, similar Tensor Core efficiency across different GPU chips, and far shorter compilation costs over C++.
For more information in regards to the fundamentals of CuTe and CUTLASS 3.x, see CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels and CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design.
CuTe DSL: The inspiration of CUTLASS 4
The brand new CuTe DSL (in Beta) in CUTLASS 4 brings the facility of CuTe to Python programmers, allowing low-level GPU kernel authoring without the effort of C++ template metaprogramming.
To simplify the educational curve related to the brand new DSL, CuTe DSL relies on the identical fundamental concepts underpinning CuTe. Visit NVIDIA/cutlass on GitHub to see just a few CuTe DSL examples, including the persistent variant of dense GEMM, grouped GEMM, and Fused Multi-Head Attention (FMHA).
Comparing CuTe DSL and CuTe C++
CuTe offers a consistent GPU programming model across greater than a decade of NVIDIA GPU architectures through its robust layout representation and algebra. CuTe DSL retains the very same programming model users have come to expect from CuTe C++ but with the benefit of Python. With this comes blazing fast compile times, substantially improved error messages, flatter learning curve, and near-instant integration into Python native DL frameworks.
A side-by-side comparison of C++ and DSL code highlights how they’ve equivalent programming models and programming patterns. The one differences are within the C++ and Python language syntax.
TiledMMA
cute::TiledMma is a spatial microkernel that describes the tiling and permutations of any hardware MMA atom across a set of “threads” and data. Its representation enables writing canonical triple for loops for any hardware MMA, be it SIMT FP64 or the cutting-edge NVFP4 Blackwell tensor core instructions.
auto tiled_mma = make_tiled_mma(SM100_MMA_F16BF16_SS{},
Layout>{});
// Allocate "fragments" -- these are literally umma tmem and smem descriptors
Tensor tCrA = tiled_mma.make_fragment_A(sA); // (MMA,MMA_M,MMA_K,PIPE)
Tensor tCrB = tiled_mma.make_fragment_B(sB); // (MMA,MMA_M,MMA_K,PIPE)
// Allocate TMEM
Tensor tCtC = tiled_mma.make_fragment_C(tCgC);// (MMA,MMA_M,MMA_N)
for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
static_assert(size<2>(tCrA) == size<2>(tCrB), "A and B contraction modes don't match!");
gemm(mma, tCrA(_,_,k_block), tCrB(_,_,k_block), tCtC)
}
# Construct a tiled_mma item
atom = tcgen05.MmaF16BF16Op(
io_dtype,
acc_dtype,
mma_inst_shape_mnk, #(128, 128, 64)
tcgen05.CtaGroup.ONE,
tcgen05.OperandSource.SMEM,
tcgen05.OperandMajorMode.K,
tcgen05.OperandMajorMode.K,
)
tiled_mma = cute.make_tiled_mma(atom)
tCrA = tiled_mma.make_fragment_A(sA) # (MMA, MMA_M, MMA_K,PIPE)
tCrB = tiled_mma.make_fragment_B(sB) # (MMA, MMA_N, MMA_K,PIPE)
tCtC = tiled_mma.make_fragemnt_C(tCgC) # (MMA, MMA_M, MMA_N)
for k_block_idx in cute.size(tCrA, mode = 2):
assert(cute.size(tCrA, mode = 2) == cute.size(tCrB, mode = 2), "A and B contraction modes don't match!");
cute.gemm(
tiled_mma, tCtC, tCrA[None, None, k_block_idx], tCrB[None, None, k_block_idx], tCtC)
TiledCopy
A canonical cute::copy is a single loop issuing some data movement instruction to repeat one tensor to a different, using the layouts of the tensors to explain any transposes or permutations that will occur along the way in which. cute::TiledCopy is a sort used to represent and confirm the applicability of optimized transfers of knowledge between any two tensors.
For instance, in numerous memory spaces akin to global to shared memory or inside memory, with or without incorporating layout transformations (transpose), using any hardware accelerated copy atom.
using TMEM_LOAD = typename std::conditional::type;
// tCtC are accumuator layout
auto tiled_ldtm = make_tmem_copy(TMEM_LOAD{}, tCtC);
auto thr_ldtm = tiled_ldtm.get_slice(threadIdx.x);
Tensor tDtC = thr_ldtm.partition_S(tCtC); // ((TMEM_LOAD,#TMEM_LOAD),MMA_M,MMA_N)
Tensor tDgC = thr_ldtm.partition_D(tCgC); // ((TMEM_LOAD,#TMEM_LOAD),MMA_M,MMA_N)
Tensor tDrC = make_tensor(shape(tDgC));// ((TMEM_LOAD,#TMEM_LOAD),MMA_M,MMA_N)
// TMEM_LOAD
copy(tiled_ldtm, tDtC, tDrC);
# Construct a tensor memory to register memory (T2R) tiled_copy item
# tCtACC are accumulator tensor, layout as (MMA, MMA_M, MMA_N)
# tCgC is the partitioned results (MMA, MMA_M, MMA_N, RestM, RestN, RestL) of worldwide tensor C (M, N)
copy_atom = cute.make_copy_atom(
tcgen05.Ld32x32bOp(tcgen05.Repetition.x128, tcgen05.Pack.NONE),
cutlass.Float32)
tiled_copy_t2r = tcgen05.make_tmem_copy(copy_atom, tCtACC)
thr_copy_t2r = tiled_copy_t2r.get_slice(tidx)
# That is tensor memory layout (T2R_M, T2R_N, EPI_M, EPI_N)
tT2R_tAcc = thr_copy_t2r.partition_S(tCtACC)
# (T2R_M, T2R_N, EPI_M, EPI_N, RestM, RestN, RestL)
tT2R_gC = thr_copy_t2r.partition_D(tCgC)
# Construct register memory layout from the partitioned global tensor
tT2R_rAcc = cute.make_fragment(
tT2R_gC[None, None, None, None, 0, 0, 0].shape, cutlass.Float32)
cute.copy(tiled_copy_t2r, tT2R_tAcc, tT2R_rAcc)
CuTe DSL performance across multiple GPU generations
Considered one of key aspects that has driven the adoption of CUTLASS C++ in training and inference frameworks is its ability to deliver blazing fast performance. CuTe DSL delivers nearly the identical level of performance, and more optimizations are within the pipeline.
Moreover, CUTLASS 3 and the underlying CuTe have been deployed in research and production use cases on the previous couple of generations of GPU hardware. The deployed GPU hardware has an extended shelf life in production environments, sometimes in heterogeneous settings. CuTe DSL, at its launch, supports NVIDIA GPU generations from Ampere to Blackwell, to support these deployments.
NVIDIA Blackwell performance
We measured the performance of three key operations: dense GEMM, grouped GEMM, and FMHA of each CUTLASS C++ and CuTe DSL. Overall, CuTe DSL performance is comparable to CUTLASS C++.
Dense GEMM
We measured the performance of dense GEMM in two precision settings, float16 and float8 e4m3. Each types use float32 as the buildup precision.
Figure 1 shows the comparative benchmarking on NVIDIA DGX B200 with CuTe DSL dense GEMM and CUTLASS 3 dense GEMM from the NVIDIA/cutlass GitHub repo. The x-axis shows the tested problem sizes, and the y-axis represents Tensor Core math throughput efficiency captured through NVIDIA Compute Nsight.
For small GEMM-K problem sizes (K=512), the DSL kernel currently performs slower than C++. That is resulting from inefficient synchronization costs before entering the maths computation of the kernel, which the team is actively working to optimize.


Grouped GEMM
Comparative benchmarking uses CuTe DSL grouped GEMM and CUTLASS 3 grouped GEMM from the NVIDIA/cutlass GitHub repo.


Fused Multi-Head Attention (FMHA)
Comparative benchmarking uses CuTe DSL FMHA and CUTLASS 3 FMHA from the NVIDIA/cutlass GitHub repo.


Ampere performance: Dense GEMM
Comparative benchmarking uses CuTe DSL dense GEMM (Ampere) and CUTLASS 3 dense GEMM (Ampere) from the NVIDIA/cutlass GitHub repo.


Reduction in compilation time
CuTe DSL offers kernel developers the power to JIT kernels using CuTe abstractions, overcoming the high compilation time of C++ templates.
As shown in Figure 5, the compilation time reduction is remarkable, on average, as much as two orders of magnitude reduced. It not only enables kernel developers to exercise more tile sizes and layout shapes to discover the fitting config quickly to extract very fast performance, but it surely also could reduce the entire time of autotuning feature of PyTorch Inductor.
GEMM on Blackwell achieves ~100x compilation speedup over C++, while flash attention on Blackwell delivers compilation speedups of 30-50x.


Easy DL framework integration
With the support of DLPack protocol, CuTe DSL is able to taking popular deep learning framework tensor data as input directly and converting it into cute.Tensor without replicating the underlying memory.
The CuTe DSL Python-native interfaces allow deep learning frameworks to embed customized kernels directly without requiring cumbersome glue code or deep expertise in CUDA C++. This accelerates development cycles by enabling researchers and engineers to prototype and deploy custom linear algebra kernels rapidly inside their existing model pipelines.
The DSL composable layout abstractions simplify expressing complex memory and thread mappings, that are critical for exploiting Tensor Core hardware efficiently across NVIDIA Ampere, Hopper, and Blackwell architectures.
Start with CuTe DSL
CuTe DSL introduces a brand new programming interface to enhance developer velocity while retaining the performance of CUTLASS C++. Try the Quick Start Guide to learn more about constructing performant kernels. You may help expand the suite of examples by contributing those kernels to the CUTLASS GitHub.
To start, download CUTLASS and skim the CUTLASS documentation. Join the NVIDIA Developer Forum for deeper discussions.
Acknowledgments
We would love to specific our gratitude to all of the CUTLASS OSS contributors. Without their foundational contributions, CUTLASS 4 wouldn’t have been possible.
