Intoduction
Custom kernels are the backbone of high-performance deep learning, enabling GPU operations tailored precisely to your workload; whether that’s image processing, tensor transformations, or other compute-heavy tasks. But compiling these kernels for the suitable architectures, wiring all of the construct flags, and integrating them cleanly into PyTorch extensions can quickly grow to be a large number of CMake/Nix, compiler errors, and ABI issues, which will not be fun. Hugging Face’s kernel-builder and kernels libraries make it easy to share these kernels with the kernels-community, with support for multiple GPU and accelerator backends, including CUDA, ROCm, Metal, and XPU. This ensures your kernels are fast, portable, and seamlessly integrated with PyTorch.
On this guide, we focus exclusively on ROCm-compatible kernels and show the way to construct, test, and share them using kernel-builder. You’ll learn the way to create kernels that run efficiently on AMD GPUs, together with best practices for reproducibility, packaging, and deployment.
This ROCm-specific walkthrough is a streamlined version of the unique kernel-builder guide. For those who’re in search of the broader CUDA-focused version, you could find it here: A Guide to Constructing and Scaling Production-Ready CUDA Kernels.
Construct Steps
We are going to use the GEMM kernel from RadeonFlow_Kernels for instance. If you ought to go straight to the guide, click here.
In regards to the kernel
This section was written by the RadeonFlow GEMM kernel authors to introduce the kernel.
Authors: ColorsWind, Zesen Liu, and Andy
The RadeonFlow GEMM kernel is a high-performance, FP8 block-wise matrix multiplication implementation optimized for the AMD Instinct MI300X GPU. GEMM (General Matrix Multiplication) is the core constructing block behind most deep learning workloads: given two matrices A and B, you compute their product C = A × B. Here it’s implemented in FP8, a low-precision floating-point format that trades a little bit of accuracy for much higher throughput and lower memory bandwidth. This kernel was developed for the AMD Developer Challenge 2025, it was awarded the 🏆 Grand Prize in June 2025, recognizing its excellence in performance and innovation on AMD hardware.
The kernel operates on quantized inputs using the e4m3fnuz floating-point format and applies per-block scaling to preserve accuracy during low-precision computation. The e4m3fnuz format is an FP8 variant with 4 exponent bits and three mantissa bits, designed to be efficient for neural network workloads. Because FP8 has a much smaller dynamic range than FP16/FP32, we apply per-block scaling aspects (a_scale and b_scale) in order that each block of values is rescaled right into a numerically “comfortable” range before and after computation, which helps preserve accuracy despite the low precision. It takes the next arguments:
(a, b, a_scale, b_scale, c)
where a and b are the input matrices, a_scale and b_scale are the scaling aspects for a and b respectively,
and c is the output matrix:
ais K × M in e4m3fnuzbis K × N in e4m3fnuza_scaleis (K // 128) × M in fp32b_scaleis (K // 128) × (N // 128) in fp32cis M × N in bf16
The kernel is precompiled for specific matrix shapes and assumes a transposed memory layout (as required by the competition). To support additional shapes or alternative memory layouts, you should modify the kernel launcher.
So now that we’ve got a high-performance ROCm kernel, the natural query is: how can we integrate it right into a real PyTorch workflow and share it with others? That’s exactly what we’ll cover next, using kernel-builder and kernels to structure, construct, and publish the ROCm kernel.
It is a fairly technical guide, but you possibly can still follow it step-by-step without understanding every detail and every thing will work fantastic. For those who’re curious, you possibly can all the time come back later to dig deeper into the concepts.
Step 1: Project Structure
The Hugging Face Kernel Builder expects your files to be organized like this:
gemm/
├── construct.toml
├── gemm
│ └── gemm_kernel.h
├── flake.nix
└── torch-ext
├── torch_binding.cpp
├── torch_binding.h
└── gemm
└── __init__.py
- construct.toml: The project manifest; it’s the brain of the construct process.
- gemm/: Your raw CUDA source code where the GPU magic happens.
- flake.nix: The important thing to a superbly reproducible construct environment.
- torch-ext/gemm/: The Python wrapper for the raw PyTorch operators
Sometimes your project might rely upon other files, like tests or helper scripts, and you possibly can add them with none issues.
In our case, our project might be structured like this:
gemm/
├── construct.toml
├── gemm
│ ├── gemm_kernel.h
│ ├── gemm_kernel_legacy.h
│ ├── transpose_kernel.h
│ └── gemm_launcher.hip
├── include
│ ├── clangd_workaround.h
│ ├── gpu_libs.h
│ ├── gpu_types.h
│ └── timer.h
├── src/utils
│ ├── arithmetic.h
│ └── timer.hip
├── tests/checker
│ ├── checker.cpp
│ ├── metrics.h
│ └── checker.h
├── flake.nix
└── torch-ext
├── torch_binding.cpp
├── torch_binding.h
└── gemm
└── __init__.py
For those who have a look at the unique files of the gemm kernel within the RadeonFlow Kernels, they’re HIP source files with .cpp extensions. As a primary step, it’s essential to change these extensions to either .h or .hip depending on their content and usage:
- Use
.hfor header files containing kernel declarations, inline functions, or template code that might be included in other files - Use
.hipfor implementation files containing HIP/GPU code that should be compiled individually (e.g., kernel launchers, device functions with complex implementations)
In our example, gemm_kernel.h, gemm_kernel_legacy.h, and transpose_kernel.h are header files, while gemm_launcher.hip is a HIP implementation file. This naming convention helps the kernel-builder appropriately discover and compile each file type.
Step 2: Configuration Files Setup
The construct.toml Manifest
This file orchestrates your complete construct. It tells the kernel-builder what to compile and the way every thing connects.
[general]
name = "gemm"
universal = false
[torch]
src = [
"torch-ext/torch_binding.cpp",
"torch-ext/torch_binding.h",
]
[kernel.gemm]
backend = "rocm"
rocm-archs = [
"gfx942",
]
depends = ["torch"]
src = [
"include/clangd_workaround.h",
"include/gpu_libs.h",
"include/gpu_types.h",
"include/timer.h",
"gemm/gemm_kernel.h",
"gemm/gemm_kernel_legacy.h",
"gemm/gemm_launcher.hip",
"gemm/transpose_kernel.h",
"src/utils/arithmetic.h",
"src/utils/timer.hip",
"tests/checker/metrics.h",
]
include = ["include"]
general
This section accommodates general project configuration settings.
- name (required): The name of your project. This could match your kernel name and might be used for the Python package.
- universal (optional): the kernel is a universal kernel when set to
true. A universal kernel is a pure Python package (no compiled files). Universal kernels don’t use the opposite sections described below. An excellent example of a universal kernel is a Triton kernel. Default:false
torch
This section describes the Torch extension configuration. It defines the Python bindings that can expose your kernel to PyTorch.
- src (required): A listing of source files and headers for the PyTorch extension. In our case, this includes the C++ binding files that create the Python interface.
kernel.gemm
Specification of a kernel named “gemm”. You may define multiple kernel sections in the identical construct.toml file if you have got multiple kernels.
- backend (required): The compute backend for the kernel. We use “rocm” for AMD GPU support.
- rocm-archs (required for ROCm): A listing of ROCm architectures that the kernel needs to be compiled for. “gfx942” targets the MI300 series GPUs.
- depends (required): A listing of dependencies. We rely upon “torch” to make use of PyTorch’s tensor operations.
- include (optional): Include directories relative to the project root. This helps the compiler find header files.
The flake.nix Reproducibility File
To make sure anyone can construct your kernel on any machine, we use a flake.nix file. It locks the precise version of the kernel-builder and its dependencies. (You may just copy and paste this instance and alter the outline)
{
description = "Flake for GEMM kernel";
inputs = {
kernel-builder.url = "github:huggingface/kernel-builder";
};
outputs =
{
self,
kernel-builder,
}:
kernel-builder.lib.genFlakeOutputs {
inherit self;
path = ./.;
};
}
Writing the Kernel
Now for the GPU code. Inside gemm/gemm_launcher.hip, we define how the GEMM kernel is launched.
Depending on the configuration, we either call the brand new optimized gemm/gemm_kernel or fall back to the legacy implementation (gemm/gemm_kernel_legacy).
extern "C" void run(
void *a, void *b, void *as, void *bs, void *c,
int m, int n, int k,
PerfMetrics *metrics, hipStream_t job_stream0
) {
const __FP8_TYPE *a_ptr = static_cast<const __FP8_TYPE *>(a);
const __FP8_TYPE *b_ptr = static_cast<const __FP8_TYPE *>(b);
__BF16_TYPE *c_ptr = static_cast<__BF16_TYPE *>(c);
const float *as_ptr = static_cast<const float *>(as);
const float *bs_ptr = static_cast<const float *>(bs);
KernelTimerScoped timer(timers, 2LL * m * n * k,
metrics ? &metrics->entries[0].time : nullptr,
metrics ? &metrics->entries[0].gflops : nullptr, job_stream0);
switch (pack_shape(m, n, k)) {
DISPATCH_GEMM(1024, 1536, 7168, 256, 128, 128, 4, 2, 512, 4, 16);
DISPATCH_GEMM(6144, 7168, 2304, 256, 128, 128, 4, 2, 512, 1, 16);
default: {
printf("Error: Unsupported shape M=%d, K=%d, N=%dn", m, k, n);
abort();
}
}
}
Registering a Native PyTorch Operator
This step is vital. We’re not only making the function available in Python; we’re turning it right into a native PyTorch operator. Meaning it becomes a first-class a part of PyTorch itself, accessible through torch.ops.
The file torch-ext/torch_binding.cpp handles this registration.
#include
#include
#include
#include "registration.h"
#include "torch_binding.h"
extern "C" {
struct PerfMetrics;
void run(void *a, void *b, void *as, void *bs, void *c, int m, int n, int k, PerfMetrics *metrics, hipStream_t job_stream0);
}
void gemm(torch::Tensor &out, torch::Tensor const &a, torch::Tensor const &b,
torch::Tensor const &as, torch::Tensor const &bs) {
TORCH_CHECK(a.device().is_cuda(), "Input tensor a have to be on GPU device");
TORCH_CHECK(b.device().is_cuda(), "Input tensor b have to be on GPU device");
TORCH_CHECK(as.device().is_cuda(), "Scale tensor as have to be on GPU device");
TORCH_CHECK(bs.device().is_cuda(), "Scale tensor bs have to be on GPU device");
TORCH_CHECK(out.device().is_cuda(), "Output tensor out have to be on GPU device");
TORCH_CHECK(a.is_contiguous(), "Input tensor a have to be contiguous");
TORCH_CHECK(b.is_contiguous(), "Input tensor b have to be contiguous");
TORCH_CHECK(as.is_contiguous(), "Scale tensor as have to be contiguous");
TORCH_CHECK(bs.is_contiguous(), "Scale tensor bs have to be contiguous");
TORCH_CHECK(out.is_contiguous(), "Output tensor out have to be contiguous");
int M = a.size(0);
int K = a.size(1);
int N = b.size(1);
TORCH_CHECK(b.size(0) == K, "Matrix dimensions mismatch: a.size(1) != b.size(0)");
TORCH_CHECK(out.size(0) == M, "Output tensor dimension mismatch: out.size(0) != M");
TORCH_CHECK(out.size(1) == N, "Output tensor dimension mismatch: out.size(1) != N");
const hipStream_t stream = 0;
run(a.data_ptr(), b.data_ptr(), as.data_ptr(), bs.data_ptr(), out.data_ptr(),
M, N, K, nullptr, stream);
}
TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.def("gemm(Tensor! out, Tensor a, Tensor b, Tensor a_scale, Tensor b_scale) -> ()");
ops.impl("gemm", torch::kCUDA, &gemm);
}
REGISTER_EXTENSION(TORCH_EXTENSION_NAME)
The torch_binding.h file accommodates function declarations. As an illustration, the gemm kernel has the next declaration in torch_binding.h:
#pragma once
#include
void gemm(torch::Tensor &out, torch::Tensor const &a, torch::Tensor const &b,
torch::Tensor const &as, torch::Tensor const &bs);
Establishing the __init__.py wrapper
In torch-ext/gemm/ we’d like an __init__.py file to make this directory a Python package and to show our custom operator in a user-friendly way.
from typing import Optional
import torch
from ._ops import ops
def gemm(a: torch.Tensor, b: torch.Tensor, as_: torch.Tensor, bs: torch.Tensor,
out: Optional[torch.Tensor] = None) -> torch.Tensor:
if out is None:
M, K = a.shape
K_b, N = b.shape
assert K == K_b, f"Matrix dimension mismatch: A has {K} cols, B has {K_b} rows"
out = torch.empty((M, N), dtype=torch.bfloat16, device=a.device)
ops.gemm(out, a, b, as_, bs)
return out
Step 3: Constructing the Kernel
The kernel builder uses Nix for constructing kernels. You may construct or run the kernels directly if you have got Nix installed in your system. We recommend installing Nix in the next way:
Getting Began with Nix
To start with, run this:
nix flake update
This generates a flake.lock file that pins the kernel builder and all its transitive dependencies. Commit each flake.nix and flake.lock to your repository to be sure that kernel builds are reproducible.
For the reason that kernel builder is determined by many packages (e.g., every supported PyTorch version), it’s endorsed to enable the Hugging Face cache to avoid expensive rebuilds:
cachix use huggingface
Or run it once without installing cachix permanently:
nix run nixpkgs
Constructing Kernels with Nix
A kernel that has a flake.nix file might be built with the build-and-copy command:
cd Build_RadeonFlow_Kernels/gemm
nix construct . -L
The compiled kernel will then be within the local construct/ directory.
Development Shell for Local Development
The kernel-builder provides shells for developing kernels. In such a shell, all required dependencies can be found, in addition to build2cmake for generating project files:
$ nix develop
$ build2cmake generate-torch construct.toml
$ cmake -B build-ext
$ cmake --build build-ext
If you ought to test the kernel as a Python package, you possibly can achieve this. nix develop will mechanically create a virtual environment in .venv and activate it:
$ nix develop
$ build2cmake generate-torch construct.toml
$ pip install --no-build-isolation -e .
Development shells can be found for each construct configuration. As an illustration, you possibly can get a Torch 2.7 development shell with ROCm 6.3 using:
$ rm -rf .venv
$ nix develop .
Step 4: Uploading the kernel to the Hub
Now that we built our kernel, we are able to test it and upload it to the Hub.
Constructing the Kernel for All PyTorch and ROCm Versions
One small thing we’ll wish to do before we share is clean up all the development artifacts that were generated in the course of the construct process to avoid uploading unnecessary files.
build2cmake clean construct.toml
To construct the kernel for all supported versions of PyTorch and ROCm, the kernel-builder tool automates the method:
nix construct . -L
Note:
This process may take some time, as it’s going to construct the kernel for all supported versions of PyTorch and ROCm.
The output might be within theresultdirectory.
The last step is to maneuver the outcomes into the expected construct directory (that is where the kernels library will search for them).
mkdir -p construct
rsync -av --delete --chmod=Du+w,Fu+w result/ construct/
Pushing to the Hugging Face Hub
Pushing the construct artifacts to the Hub will make it straightforward for other developers to make use of your kernel.
First, create a brand new repo:
hf repo create gemm
Be certain you’re logged in to the Hugging Face Hub using huggingface-cli login.
Now, in your project directory, connect your project to the brand new repository and push your code:
git init
git distant add origin https://huggingface.co//gemm
git pull origin major
git xet install
git checkout -b major
git xet track "*.so"
git add
construct/ gemm/ include/ src/utils tests/checker
torch-ext/torch_binding.cpp torch-ext/torch_binding.h torch-ext/gemm
flake.nix flake.lock construct.toml
git commit -m "feat: Created a compliant gemm kernel"
git push -u origin major
Implausible! Your kernel is now on the Hugging Face Hub, ready for others to make use of and fully compliant with the kernels library.
Step 5: Let’s use it 🙂
With the kernels library, you do not “install” the kernel in the normal sense. You load it directly from its Hub repository, which mechanically registers the brand new operator.
import torch
from kernels import get_kernel
gemm = get_kernel("kernels-community/gemm")
M, N, K = 1024, 1536, 7168
QUANT_SIZE = 128
device = torch.device("cuda")
A_fp32 = torch.randn(M, K, device=device)
B_fp32 = torch.randn(K, N, device=device)
A_fp8 = A_fp32.to(torch.float8_e4m3fnuz)
B_fp8 = B_fp32.to(torch.float8_e4m3fnuz)
A_scale = torch.ones(K // QUANT_SIZE, M, device=device, dtype=torch.float32)
B_scale = torch.ones(K // QUANT_SIZE, N // QUANT_SIZE, device=device, dtype=torch.float32)
C = torch.zeros(M, N, device=device, dtype=torch.bfloat16)
result = gemm.gemm(A_fp8, B_fp8, A_scale, B_scale, C)
That is it! Your ROCm kernel is now able to use from the Hugging Face Hub.
Conclusion
Constructing and sharing ROCm kernels with the Hugging Face is now easier than ever. With a clean, reproducible workflow powered by Nix and seamless integration into PyTorch, developers can deal with optimizing performance somewhat than setup. Once built, your custom kernel might be shared on the Hugging Face Hub; making it immediately accessible to the community and usable across projects with just just a few lines of code. 🚀
