Easily Construct and Share ROCm Kernels with Hugging Face

Intoduction

Custom kernels are the backbone of high-performance deep learning, enabling GPU operations tailored precisely to your workload; whether that’s image processing, tensor transformations, or other compute-heavy tasks. But compiling these kernels for the suitable architectures, wiring all of the construct flags, and integrating them cleanly into PyTorch extensions can quickly grow to be a large number of CMake/Nix, compiler errors, and ABI issues, which will not be fun. Hugging Face’s kernel-builder and kernels libraries make it easy to share these kernels with the kernels-community, with support for multiple GPU and accelerator backends, including CUDA, ROCm, Metal, and XPU. This ensures your kernels are fast, portable, and seamlessly integrated with PyTorch.

On this guide, we focus exclusively on ROCm-compatible kernels and show the way to construct, test, and share them using kernel-builder. You’ll learn the way to create kernels that run efficiently on AMD GPUs, together with best practices for reproducibility, packaging, and deployment.

This ROCm-specific walkthrough is a streamlined version of the unique kernel-builder guide. For those who’re in search of the broader CUDA-focused version, you could find it here: A Guide to Constructing and Scaling Production-Ready CUDA Kernels.

Construct Steps

We are going to use the GEMM kernel from RadeonFlow_Kernels for instance. If you ought to go straight to the guide, click here.

In regards to the kernel

This section was written by the RadeonFlow GEMM kernel authors to introduce the kernel.
Authors: ColorsWind, Zesen Liu, and Andy

The RadeonFlow GEMM kernel is a high-performance, FP8 block-wise matrix multiplication implementation optimized for the AMD Instinct MI300X GPU. GEMM (General Matrix Multiplication) is the core constructing block behind most deep learning workloads: given two matrices A and B, you compute their product C = A × B. Here it’s implemented in FP8, a low-precision floating-point format that trades a little bit of accuracy for much higher throughput and lower memory bandwidth. This kernel was developed for the AMD Developer Challenge 2025, it was awarded the 🏆 Grand Prize in June 2025, recognizing its excellence in performance and innovation on AMD hardware.

The kernel operates on quantized inputs using the e4m3fnuz floating-point format and applies per-block scaling to preserve accuracy during low-precision computation. The e4m3fnuz format is an FP8 variant with 4 exponent bits and three mantissa bits, designed to be efficient for neural network workloads. Because FP8 has a much smaller dynamic range than FP16/FP32, we apply per-block scaling aspects (a_scale and b_scale) in order that each block of values is rescaled right into a numerically “comfortable” range before and after computation, which helps preserve accuracy despite the low precision. It takes the next arguments:

(a, b, a_scale, b_scale, c)

where a and b are the input matrices, a_scale and b_scale are the scaling aspects for a and b respectively,
and c is the output matrix:

a is K × M in e4m3fnuz
b is K × N in e4m3fnuz
a_scale is (K // 128) × M in fp32
b_scale is (K // 128) × (N // 128) in fp32
c is M × N in bf16

The kernel is precompiled for specific matrix shapes and assumes a transposed memory layout (as required by the competition). To support additional shapes or alternative memory layouts, you should modify the kernel launcher.

So now that we’ve got a high-performance ROCm kernel, the natural query is: how can we integrate it right into a real PyTorch workflow and share it with others? That’s exactly what we’ll cover next, using kernel-builder and kernels to structure, construct, and publish the ROCm kernel.

It is a fairly technical guide, but you possibly can still follow it step-by-step without understanding every detail and every thing will work fantastic. For those who’re curious, you possibly can all the time come back later to dig deeper into the concepts.

Step 1: Project Structure

The Hugging Face Kernel Builder expects your files to be organized like this:

gemm/
├── construct.toml
├── gemm
│   └── gemm_kernel.h
├── flake.nix
└── torch-ext
    ├── torch_binding.cpp
    ├── torch_binding.h
    └── gemm
        └── __init__.py

construct.toml: The project manifest; it’s the brain of the construct process.
gemm/: Your raw CUDA source code where the GPU magic happens.
flake.nix: The important thing to a superbly reproducible construct environment.
torch-ext/gemm/: The Python wrapper for the raw PyTorch operators

Sometimes your project might rely upon other files, like tests or helper scripts, and you possibly can add them with none issues.
In our case, our project might be structured like this:

gemm/
├── construct.toml
├── gemm
│   ├── gemm_kernel.h
│   ├── gemm_kernel_legacy.h
│   ├── transpose_kernel.h
│   └── gemm_launcher.hip
├── include
│   ├── clangd_workaround.h
│   ├── gpu_libs.h
│   ├── gpu_types.h
│   └── timer.h
├── src/utils
│   ├── arithmetic.h
│   └── timer.hip
├── tests/checker
│   ├── checker.cpp
│   ├── metrics.h
│   └── checker.h
├── flake.nix
└── torch-ext
    ├── torch_binding.cpp
    ├── torch_binding.h
    └── gemm
        └── __init__.py

For those who have a look at the unique files of the gemm kernel within the RadeonFlow Kernels, they’re HIP source files with .cpp extensions. As a primary step, it’s essential to change these extensions to either .h or .hip depending on their content and usage:

Use .h for header files containing kernel declarations, inline functions, or template code that might be included in other files
Use .hip for implementation files containing HIP/GPU code that should be compiled individually (e.g., kernel launchers, device functions with complex implementations)

In our example, gemm_kernel.h, gemm_kernel_legacy.h, and transpose_kernel.h are header files, while gemm_launcher.hip is a HIP implementation file. This naming convention helps the kernel-builder appropriately discover and compile each file type.

Step 2: Configuration Files Setup

The `construct.toml` Manifest

This file orchestrates your complete construct. It tells the kernel-builder what to compile and the way every thing connects.

[general]
name = "gemm"
universal = false

[torch]
src = [
  "torch-ext/torch_binding.cpp",
  "torch-ext/torch_binding.h",
]

[kernel.gemm]
backend = "rocm"
rocm-archs = [
    "gfx942",
]

depends = ["torch"]

src = [
  "include/clangd_workaround.h",
  "include/gpu_libs.h",
  "include/gpu_types.h",
  "include/timer.h",
  "gemm/gemm_kernel.h",
  "gemm/gemm_kernel_legacy.h",
  "gemm/gemm_launcher.hip",
  "gemm/transpose_kernel.h",
  "src/utils/arithmetic.h",
  "src/utils/timer.hip",
  "tests/checker/metrics.h",
]

include = ["include"]

general

This section accommodates general project configuration settings.

name (required): The name of your project. This could match your kernel name and might be used for the Python package.
universal (optional): the kernel is a universal kernel when set to true. A universal kernel is a pure Python package (no compiled files). Universal kernels don’t use the opposite sections described below. An excellent example of a universal kernel is a Triton kernel. Default: false

torch

This section describes the Torch extension configuration. It defines the Python bindings that can expose your kernel to PyTorch.

src (required): A listing of source files and headers for the PyTorch extension. In our case, this includes the C++ binding files that create the Python interface.

kernel.gemm

Specification of a kernel named “gemm”. You may define multiple kernel sections in the identical construct.toml file if you have got multiple kernels.

backend (required): The compute backend for the kernel. We use “rocm” for AMD GPU support.
rocm-archs (required for ROCm): A listing of ROCm architectures that the kernel needs to be compiled for. “gfx942” targets the MI300 series GPUs.
depends (required): A listing of dependencies. We rely upon “torch” to make use of PyTorch’s tensor operations.
include (optional): Include directories relative to the project root. This helps the compiler find header files.

The `flake.nix` Reproducibility File

To make sure anyone can construct your kernel on any machine, we use a flake.nix file. It locks the precise version of the kernel-builder and its dependencies. (You may just copy and paste this instance and alter the outline)

{
  description = "Flake for GEMM kernel";

  inputs = {
    kernel-builder.url = "github:huggingface/kernel-builder";
  };

  outputs =
    {
      self,
      kernel-builder,
    }:

    kernel-builder.lib.genFlakeOutputs {
      inherit self;
      path = ./.;
    };
}

Writing the Kernel

Now for the GPU code. Inside gemm/gemm_launcher.hip, we define how the GEMM kernel is launched.
Depending on the configuration, we either call the brand new optimized gemm/gemm_kernel or fall back to the legacy implementation (gemm/gemm_kernel_legacy).


extern "C" void run(
    void *a, void *b, void *as, void *bs, void *c,
    int m, int n, int k,
    PerfMetrics *metrics, hipStream_t job_stream0
) {
    const __FP8_TYPE *a_ptr = static_cast<const __FP8_TYPE *>(a);
    const __FP8_TYPE *b_ptr = static_cast<const __FP8_TYPE *>(b);
    __BF16_TYPE *c_ptr = static_cast<__BF16_TYPE *>(c);
    const float *as_ptr = static_cast<const float *>(as);
    const float *bs_ptr = static_cast<const float *>(bs);

    KernelTimerScoped timer(timers, 2LL * m * n * k,
        metrics ? &metrics->entries[0].time : nullptr,
        metrics ? &metrics->entries[0].gflops : nullptr, job_stream0);

    
    switch (pack_shape(m, n, k)) {
        DISPATCH_GEMM(1024, 1536, 7168, 256, 128, 128, 4, 2, 512, 4, 16);
        DISPATCH_GEMM(6144, 7168, 2304, 256, 128, 128, 4, 2, 512, 1, 16);
        default: {
            printf("Error: Unsupported shape M=%d, K=%d, N=%dn", m, k, n);
            abort();
        }
    }
}

Registering a Native PyTorch Operator

This step is vital. We’re not only making the function available in Python; we’re turning it right into a native PyTorch operator. Meaning it becomes a first-class a part of PyTorch itself, accessible through torch.ops.

The file torch-ext/torch_binding.cpp handles this registration.

#include 
#include 
#include 

#include "registration.h"
#include "torch_binding.h"


extern "C" {
    struct PerfMetrics;
    void run(void *a, void *b, void *as, void *bs, void *c, int m, int n, int k, PerfMetrics *metrics, hipStream_t job_stream0);
}

void gemm(torch::Tensor &out, torch::Tensor const &a, torch::Tensor const &b, 
          torch::Tensor const &as, torch::Tensor const &bs) {
    
    
    TORCH_CHECK(a.device().is_cuda(), "Input tensor a have to be on GPU device");
    TORCH_CHECK(b.device().is_cuda(), "Input tensor b have to be on GPU device");
    TORCH_CHECK(as.device().is_cuda(), "Scale tensor as have to be on GPU device");
    TORCH_CHECK(bs.device().is_cuda(), "Scale tensor bs have to be on GPU device");
    TORCH_CHECK(out.device().is_cuda(), "Output tensor out have to be on GPU device");
    
    TORCH_CHECK(a.is_contiguous(), "Input tensor a have to be contiguous");
    TORCH_CHECK(b.is_contiguous(), "Input tensor b have to be contiguous");
    TORCH_CHECK(as.is_contiguous(), "Scale tensor as have to be contiguous");
    TORCH_CHECK(bs.is_contiguous(), "Scale tensor bs have to be contiguous");
    TORCH_CHECK(out.is_contiguous(), "Output tensor out have to be contiguous");
    
    
    
    int M = a.size(0);
    int K = a.size(1);
    int N = b.size(1);
    
    TORCH_CHECK(b.size(0) == K, "Matrix dimensions mismatch: a.size(1) != b.size(0)");
    TORCH_CHECK(out.size(0) == M, "Output tensor dimension mismatch: out.size(0) != M");
    TORCH_CHECK(out.size(1) == N, "Output tensor dimension mismatch: out.size(1) != N");
    
    
    const hipStream_t stream = 0;
    
    
    run(a.data_ptr(), b.data_ptr(), as.data_ptr(), bs.data_ptr(), out.data_ptr(),
        M, N, K, nullptr, stream);
}

TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
  ops.def("gemm(Tensor! out, Tensor a, Tensor b, Tensor a_scale, Tensor b_scale) -> ()");
  ops.impl("gemm", torch::kCUDA, &gemm);
}

REGISTER_EXTENSION(TORCH_EXTENSION_NAME)

The torch_binding.h file accommodates function declarations. As an illustration, the gemm kernel has the next declaration in torch_binding.h:

#pragma once

#include 

void gemm(torch::Tensor &out, torch::Tensor const &a, torch::Tensor const &b, 
          torch::Tensor const &as, torch::Tensor const &bs);

Establishing the `init.py` wrapper

In torch-ext/gemm/ we’d like an __init__.py file to make this directory a Python package and to show our custom operator in a user-friendly way.

from typing import Optional
import torch
from ._ops import ops

def gemm(a: torch.Tensor, b: torch.Tensor, as_: torch.Tensor, bs: torch.Tensor, 
         out: Optional[torch.Tensor] = None) -> torch.Tensor:
         
    if out is None:
        
        M, K = a.shape
        K_b, N = b.shape
        assert K == K_b, f"Matrix dimension mismatch: A has {K} cols, B has {K_b} rows"
        
        
        out = torch.empty((M, N), dtype=torch.bfloat16, device=a.device)
    
    ops.gemm(out, a, b, as_, bs)
    return out

Step 3: Constructing the Kernel

The kernel builder uses Nix for constructing kernels. You may construct or run the kernels directly if you have got Nix installed in your system. We recommend installing Nix in the next way:

Getting Began with Nix

To start with, run this:

nix flake update

This generates a flake.lock file that pins the kernel builder and all its transitive dependencies. Commit each flake.nix and flake.lock to your repository to be sure that kernel builds are reproducible.

For the reason that kernel builder is determined by many packages (e.g., every supported PyTorch version), it’s endorsed to enable the Hugging Face cache to avoid expensive rebuilds:


cachix use huggingface

Or run it once without installing cachix permanently:


nix run nixpkgs

Constructing Kernels with Nix

A kernel that has a flake.nix file might be built with the build-and-copy command:

cd Build_RadeonFlow_Kernels/gemm
nix construct . -L

The compiled kernel will then be within the local construct/ directory.

Development Shell for Local Development

The kernel-builder provides shells for developing kernels. In such a shell, all required dependencies can be found, in addition to build2cmake for generating project files:

$ nix develop
$ build2cmake generate-torch construct.toml
$ cmake -B build-ext
$ cmake --build build-ext

If you ought to test the kernel as a Python package, you possibly can achieve this. nix develop will mechanically create a virtual environment in .venv and activate it:

$ nix develop
$ build2cmake generate-torch construct.toml
$ pip install --no-build-isolation -e .

Development shells can be found for each construct configuration. As an illustration, you possibly can get a Torch 2.7 development shell with ROCm 6.3 using:

$ rm -rf .venv  
$ nix develop .

Step 4: Uploading the kernel to the Hub

Now that we built our kernel, we are able to test it and upload it to the Hub.

Constructing the Kernel for All PyTorch and ROCm Versions

One small thing we’ll wish to do before we share is clean up all the development artifacts that were generated in the course of the construct process to avoid uploading unnecessary files.

build2cmake clean construct.toml

To construct the kernel for all supported versions of PyTorch and ROCm, the kernel-builder tool automates the method:



nix construct . -L

Note:
This process may take some time, as it’s going to construct the kernel for all supported versions of PyTorch and ROCm.
The output might be within the result directory.

The last step is to maneuver the outcomes into the expected construct directory (that is where the kernels library will search for them).

mkdir -p construct
rsync -av --delete --chmod=Du+w,Fu+w result/ construct/

Pushing to the Hugging Face Hub

Pushing the construct artifacts to the Hub will make it straightforward for other developers to make use of your kernel.

First, create a brand new repo:

hf repo create gemm

Be certain you’re logged in to the Hugging Face Hub using huggingface-cli login.

Now, in your project directory, connect your project to the brand new repository and push your code:


git init
git distant add origin https://huggingface.co//gemm


git pull origin major
git xet install
git checkout -b major


git xet track "*.so"



git add 
  construct/ gemm/ include/ src/utils tests/checker 
  torch-ext/torch_binding.cpp torch-ext/torch_binding.h torch-ext/gemm 
  flake.nix flake.lock construct.toml

git commit -m "feat: Created a compliant gemm kernel"
git push -u origin major

Implausible! Your kernel is now on the Hugging Face Hub, ready for others to make use of and fully compliant with the kernels library.

Step 5: Let’s use it 🙂

With the kernels library, you do not “install” the kernel in the normal sense. You load it directly from its Hub repository, which mechanically registers the brand new operator.

import torch
from kernels import get_kernel


gemm = get_kernel("kernels-community/gemm")


M, N, K = 1024, 1536, 7168
QUANT_SIZE = 128


device = torch.device("cuda")


A_fp32 = torch.randn(M, K, device=device)
B_fp32 = torch.randn(K, N, device=device)


A_fp8 = A_fp32.to(torch.float8_e4m3fnuz)
B_fp8 = B_fp32.to(torch.float8_e4m3fnuz)


A_scale = torch.ones(K // QUANT_SIZE, M, device=device, dtype=torch.float32)
B_scale = torch.ones(K // QUANT_SIZE, N // QUANT_SIZE, device=device, dtype=torch.float32)

C = torch.zeros(M, N, device=device, dtype=torch.bfloat16)


result = gemm.gemm(A_fp8, B_fp8, A_scale, B_scale, C)

That is it! Your ROCm kernel is now able to use from the Hugging Face Hub.

Conclusion

Constructing and sharing ROCm kernels with the Hugging Face is now easier than ever. With a clean, reproducible workflow powered by Nix and seamless integration into PyTorch, developers can deal with optimizing performance somewhat than setup. Once built, your custom kernel might be shared on the Hugging Face Hub; making it immediately accessible to the community and usable across projects with just just a few lines of code. 🚀

Related Libraries & Hub

Source link

Easily Construct and Share ROCm Kernels with Hugging Face

Intoduction

Construct Steps

In regards to the kernel

Step 1: Project Structure

Step 2: Configuration Files Setup

The `construct.toml` Manifest

The `flake.nix` Reproducibility File

Writing the Kernel

Registering a Native PyTorch Operator

Establishing the `init.py` wrapper

Step 3: Constructing the Kernel

Getting Began with Nix

Constructing Kernels with Nix

Development Shell for Local Development

Step 4: Uploading the kernel to the Hub

Constructing the Kernel for All PyTorch and ROCm Versions

Pushing to the Hugging Face Hub

Step 5: Let’s use it 🙂

Conclusion

Related Libraries & Hub

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Recent and Fresh analytics in Inference Endpoints

How AlphaChip transformed computer chip design

The Machine Learning “Advent Calendar” Day 13: LASSO and Ridge Regression in Excel

Introducing Gradio's recent Dataframe!

Demis Hassabis & John Jumper awarded Nobel Prize in Chemistry

Easily Construct and Share ROCm Kernels with Hugging Face

Intoduction

Construct Steps

In regards to the kernel

Step 1: Project Structure

Step 2: Configuration Files Setup

The construct.toml Manifest

The flake.nix Reproducibility File

Writing the Kernel

Registering a Native PyTorch Operator

Establishing the __init__.py wrapper

Step 3: Constructing the Kernel

Getting Began with Nix

Constructing Kernels with Nix

Development Shell for Local Development

Step 4: Uploading the kernel to the Hub

Constructing the Kernel for All PyTorch and ROCm Versions

Pushing to the Hugging Face Hub

Step 5: Let’s use it 🙂

Conclusion

Related Libraries & Hub

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

The `construct.toml` Manifest

The `flake.nix` Reproducibility File

Establishing the `init.py` wrapper

What are your thoughts on this topic?
Let us know in the comments below.