Custom CUDA kernels give your models a serious performance edge, but constructing them for the actual world can feel daunting. How do you progress beyond an easy GPU function to create a strong, scalable system without getting bogged down by countless construct times and dependency nightmares?
We created the kernel-builder library for this purpose. You possibly can develop a custom kernel locally, after which construct it for multiple architectures and make it available for the world to make use of.
On this guide we’ll show you the way to construct a whole, modern CUDA kernel from the bottom up. Then, we’ll tackle the tough production and deployment challenges, drawing on real-world engineering strategies to point out you the way to construct systems that usually are not just fast, but additionally efficient and maintainable.
What You’ll Learn
Once you’re done, other developers will have the opportunity to make use of your kernels directly from the hub like this:
import torch
from kernels import get_kernel
optimized_kernel = get_kernel("your-username/optimized-kernel")
some_input = torch.randn((10, 10), device="cuda")
out = optimized_kernel.my_kernel_function(some_input)
print(out)
Reasonably watch a video? Take a look at the YouTube video that accompanies this guide.
Let’s Get Began! 🚀
Part 1: Anatomy of a Modern CUDA Kernel
Let’s construct a practical kernel that converts a picture from RGB to grayscale. This instance uses PyTorch’s modern C++ API to register our function as a first-class, native operator.
Step 1: Project Structure
A clean, predictable structure is the inspiration of an excellent project. The Hugging Face Kernel Builder expects your files to be organized like this:
img2gray/
├── construct.toml
├── csrc
│ └── img2gray.cu
├── flake.nix
└── torch-ext
├── torch_binding.cpp
├── torch_binding.h
└── img2gray
└── __init__.py
construct.toml: The project manifest; it’s the brain of the construct process.csrc/: Your raw CUDA source code where the GPU magic happens.flake.nix: The important thing to a superbly reproducible* construct environment.torch-ext/img2gray/: The Python wrapper for the raw PyTorch operators.
Step 2: The construct.toml Manifest
This file orchestrates the complete construct. It tells the kernel-builder what to compile and the way all the pieces connects.
[general]
name = "img2gray"
universal = false
[torch]
src = [
"torch-ext/torch_binding.cpp",
"torch-ext/torch_binding.h"
]
[kernel.img2gray]
backend = "cuda"
depends = ["torch"]
src = [
"csrc/img2gray.cu",
]
Step 3: The flake.nix Reproducibility File
To make sure anyone can construct your kernel on any machine, we use a flake.nix file. It locks the precise version of the kernel-builder and its dependencies, eliminating “it really works on my machine” issues.
{
description = "Flake for img2gray kernel";
inputs = {
kernel-builder.url = "github:huggingface/kernel-builder";
};
outputs =
{
self,
kernel-builder,
}:
kernel-builder.lib.genFlakeOutputs {
path = ./.;
rev = self.shortRev or self.dirtyShortRev or self.lastModifiedDate;
};
}
Step 4: Writing the CUDA Kernel
Now for the GPU code. Inside csrc/img2gray.cu, we’ll define a kernel that uses a 2D grid of threads—a natural and efficient fit for processing images.
#include
#include
__global__ void img2gray_kernel(const uint8_t* input, uint8_t* output, int width, int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) {
int idx = (y * width + x) * 3;
uint8_t r = input[idx];
uint8_t g = input[idx + 1];
uint8_t b = input[idx + 2];
uint8_t gray = static_cast<uint8_t>(0.21f * r + 0.72f * g + 0.07f * b);
output[y * width + x] = gray;
}
}
void img2gray_cuda(torch::Tensor const &input, torch::Tensor &output) {
const int width = input.size(1);
const int height = input.size(0);
const dim3 blockSize(16, 16);
const dim3 gridSize((width + blockSize.x - 1) / blockSize.x, (height + blockSize.y - 1) / blockSize.y);
img2gray_kernel<<>>(
input.data_ptr<uint8_t>(),
output.data_ptr<uint8_t>(),
width,
height
);
}
Step 5: Registering a Native PyTorch Operator
That is an important step. We’re not only binding to Python; we’re registering our function as a native PyTorch operator. This makes it a first-class citizen within the PyTorch ecosystem, visible under the torch.ops namespace.
The file torch-ext/torch_binding.cpp handles this registration.
#include
#include "registration.h"
#include "torch_binding.h"
TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.def("img2gray(Tensor input, Tensor! output) -> ()");
ops.impl("img2gray", torch::kCUDA, &img2gray_cuda);
}
REGISTER_EXTENSION(TORCH_EXTENSION_NAME)
In easy terms, TORCH_LIBRARY_EXPAND allows us to define our operator in a way that might be easily prolonged or modified in the longer term.
Why This Matters
This approach is crucial for 2 important reasons:
-
Compatibility with
torch.compile: By registering our kernel this fashion,torch.compilecan “see” it. This permits PyTorch to fuse your custom operator into larger computation graphs, minimizing overhead and maximizing performance. It’s the important thing to creating your custom code work seamlessly with PyTorch’s broader performance ecosystem. -
Hardware-Specific Implementations: This method permits you to provide different backends for a similar operator. You can add one other
TORCH_LIBRARY_IMPL(img2gray, CPU, ...)block pointing to a C++ CPU function. PyTorch’s dispatcher would then robotically call the proper implementation (CUDA or CPU) based on the input tensor’s device, making your code powerful and portable.Establishing the
__init__.pywrapper
In torch-ext/img2gray/ we’d like an __init__.py file to make this directory a Python package and to show our custom operator in a user-friendly way.
The
_opsmodule is auto-generated by kernel-builder from a template to supply a regular namespace on your registered C++ functions.
import torch
from ._ops import ops
def img2gray(input: torch.Tensor) -> torch.Tensor:
height, width, channels = input.shape
assert channels == 3, "Input image will need to have 3 channels (RGB)"
output = torch.empty((height, width), device=input.device, dtype=input.dtype)
ops.img2gray(input, output)
return output
Step 6: Constructing the Kernel
Now that our kernel and its bindings are ready, it is time to construct them. The kernel-builder tool simplifies this process.
You possibly can construct your kernel with a single command, nix construct . -L; nonetheless, as developers, we’ll desire a faster, more iterative workflow. For that, we’ll use the nix develop command to enter a development shell with all of the needed dependencies pre-installed.
More specifically, we will select the precise CUDA and PyTorch versions we would like to make use of. For instance, to construct our kernel for PyTorch 2.7 with CUDA 12.6, we will use the next command:
Drop right into a Nix Shell
nix develop .
Note that the devShell name above might be deciphered as:
nix develop .
│ │ │ │
│ │ │ └─── Architecture: x86_64 (Linux)
│ │ └────────── CUDA version: 12.6
│ └──────────────────── C++ ABI: cxx11
└──────────────────────────── Torch version: 2.7
At this point, we’ll be inside a Nix shell with all dependencies installed. We will now construct the kernel for this particular architecture and test it. In a while, we’ll cope with multiple architectures before distributing the ultimate version of the kernel.
Set Up Construct Artifacts
build2cmake generate-torch construct.toml
This command creates a handful of files used to construct the kernel: CMakeLists.txt, pyproject.toml, setup.py, and a cmake directory. The CMakeLists.txt file is the important entry point for CMake to construct the kernel.
Create a Python Virtual Environment
python -m venv .venv
source .venv/bin/activate
Now you may install the kernel in editable mode.
Compile the Kernel and Install the Python Package
pip install --no-build-isolation -e .
🙌 Amazing! We now have a custom built kernel that follows best practices for PyTorch bindings, with a completely reproducible construct process.
Development Cycle
To make sure all the pieces is working accurately, we will run an easy test to ascertain that the kernel is registered and it really works as expected. If it doesn’t, you may iterate by editing the source files and repeating the construct, reusing the nix environment you created.
import torch
import img2gray
from PIL import Image
import numpy as np
print(dir(img2gray))
img = Image.open("kernel-builder-logo-color.png").convert("RGB")
img = np.array(img)
img_tensor = torch.from_numpy(img).cuda()
print(img_tensor.shape)
gray_tensor = img2gray.img2gray(img_tensor).squeeze()
print(gray_tensor.shape)
gray_img = Image.fromarray(gray_tensor.cpu().numpy().astype(np.uint8), mode="L")
gray_img.save("kernel-builder-logo-gray.png")
Step 7: Sharing with the World
Now that we now have a working kernel, it is time to share it with other developers and the world!
One small thing we’ll need to do before we share, is clean up all the development artifacts that were generated throughout the construct process to avoid uploading unnecessary files.
build2cmake clean construct.toml
Constructing the Kernel for All PyTorch and CUDA Versions
Earlier, we built the kernel for a particular version of PyTorch and CUDA. Nonetheless, to make it available to a wider audience, we’d like to construct it for all supported versions. The kernel-builder tool might help us with that.
This can be where the concept of a compliant kernel comes into play. A compliant kernel is one which might be built and run for all supported versions of PyTorch and CUDA. Generally, this requires custom configuration; nonetheless, in our case, the kernel-builder tool will automate the method.
nix construct . -L
This process may take some time, as it is going to construct the kernel for all supported versions of PyTorch and CUDA. The output might be within the
resultdirectory.
The kernel-builder team actively maintains the supported construct variants, keeping them current with the most recent PyTorch and CUDA releases while also supporting trailing versions for broader compatibility.
The last step is to maneuver the outcomes into the expected construct directory (that is where the kernels library will search for them).
mkdir -p construct
rsync -av --delete --chmod=Du+w,Fu+w result/ construct/
Pushing to the Hugging Face Hub
Pushing the construct artifacts to the Hub will make it straightforward for other developers to make use of your kernel, as we saw in our previous post.
First, create a brand new repo:
hf repo create img2gray
Be sure you might be logged in to the Hugging Face Hub using
huggingface-cli login.
Now, in your project directory, connect your project to the brand new repository and push your code:
git init
git distant add origin https://huggingface.co//img2gray
git pull origin important
git lfs install
git checkout -b important
git lfs track "*.so"
git add
construct/ csrc/
torch-ext/torch_binding.cpp torch-ext/torch_binding.h torch-ext/img2gray
flake.nix flake.lock construct.toml
git commit -m "feat: Created a compliant img2gray kernel"
git push -u origin important
Unbelievable! Your kernel is now on the Hugging Face Hub, ready for others to make use of and fully compliant with the kernels library. Our kernel and all of its construct variants at the moment are available at drbh/img2gray.
Step 8: Loading and Testing Your Custom Op
With the kernels library, you do not “install” the kernel in the standard sense. You load it directly from its Hub repository, which robotically registers the brand new operator.
import torch
from PIL import Image
import numpy as np
from kernels import get_kernel
img2gray_lib = get_kernel("drbh/img2gray")
img = Image.open("kernel-builder-logo-color.png").convert("RGB")
img = np.array(img)
img_tensor = torch.from_numpy(img).cuda()
print(img_tensor.shape)
gray_tensor = img2gray_lib.img2gray(img_tensor).squeeze()
print(gray_tensor.shape)
gray_img = Image.fromarray(gray_tensor.cpu().numpy().astype(np.uint8), mode="L")
gray_img.save("kernel-builder-logo-gray2.png")
Part 2: From One Kernel to Many: Solving Production Challenges
Once you may have a ready-to-use kernel, there are some things you may do to make it easier to deploy your kernel. We are going to discuss using versioning as a tool to make API changes without breaking downstream use of kernels. After that, we’ll wrap up showing how you may make Python wheels on your kernel.
Kernel Versions
You would possibly determine to update your kernel after some time. Possibly you may have found recent ways of improving performance or perhaps you want to to increase the kernel’s functionality. Some changes would require you to vary the API of your kernel. As an illustration, a more moderen version might add a brand new mandatory argument to one in all the general public functions. This might be inconvenient to downstream users, because their code would break until they add this recent argument.
A downstream user of a kernel can avoid such breakage by pinning the kernel that they use to a specific revision. As an illustration, since each Hub repository can be a Git repository, they may use a git commit shorthash to pin the kernel to a revision:
from kernels import get_kernel
img2gray_lib = get_kernel("drbh/img2gray", revision="4148918")
Using a Git shorthash will reduce the prospect of breakage; nonetheless, it is tough to interpret and doesn’t allow graceful upgrades inside a version range. We due to this fact recommend using the familiar semantic versioning system for Hub kernels. Adding a version to a kernel is straightforward: you just add a Git tag of the shape vx.y.z where x.y.z is the version. As an illustration, if the present version of the kernel is 1.1.2, you may tag it as v1.1.2. You possibly can then get that version with get_kernel:
from kernels import get_kernel
img2gray_lib = get_kernel("drbh/img2gray", revision="v1.1.2")
Versioning becomes much more powerful with version bounds. In semantic versioning, the version 1.y.z, must not have backward-incompatible changes in the general public API for every succeeding x and y. So, if the kernel’s version was 1.1.2 on the time of writing your code, you may ask the version to be a minimum of 1.1.2, but lower than 2.0.0:
from kernels import get_kernel
img2gray_lib = get_kernel("drbh/img2gray", version=">=1.1.2,<2")
This can be sure that the code will at all times fetch the most recent kernel from the 1.y.z series. The version certain could be a Python-style version specifier.
You possibly can tag a version with huggingface-cli:
$ huggingface-cli tag drbh/img2gray v1.1.2
Locking Kernels
In large projects, it’s possible you’ll need to coordinate the kernel versions globally somewhat than in each get_kernel call. Furthermore, it is usually useful to lock kernels, so that every one your users have the identical kernel versions, which aids handling bug reports.
The kernels library offers a pleasant way of managing kernels on the project-level. To accomplish that, add the kernels package to the build-system requirements of your project, within the pyproject.toml file. After doing so, you may specify your project’s kernel requirements within the tools.kernels section:
[build-system]
requires = ["kernels", "setuptools"]
build-backend = "setuptools.build_meta"
[tool.kernels.dependencies]
"drbh/img2gray" = ">=0.1.2,<0.2.0"
The version might be specified with the identical sort of version specifiers as Python dependencies. That is one other place where the version tags (va.b.c) come handy — kernels will use a repository’s version tags to question what versions can be found. After specifying a kernel in pyproject.toml, you may lock it to a particular version using the kernels command-line utility. This utility is an element of the kernels Python package:
$ kernels lock .
This generates a kernels.lock file with the most recent kernel versions which might be compatible with the bounds which might be laid out in pyproject.toml. kernels.lock ought to be committed to your project’s Git repository, in order that every user of the project will get the locked kernel versions. When newer kernels versions are released, you may run kernels lock again to update the lock file.
You would like one last bit to totally implement locked kernels in a project. The get_locked_kernel is the counterpart to get_kernel that uses locked kernels. So to make use of locked kernels, replace every occurrence of get_kernel with get_locked_kernel:
from kernels import get_kernel
img2gray_lib = get_locked_kernel("drbh/img2gray")
That is it! Every call of get_locked_kernel("drbh/img2gray") within the project will now use the version laid out in kernels.lock.
Pre-downloading Locked Kernels
The get_locked_kernel function will download the kernel when it is just not available within the local Hub cache. This is just not ideal for applications where you are not looking for to download binaries at runtime. For instance, when you’re constructing a Docker image for an application, you often want the kernels to be stored within the image together with the appliance. This might be done in two easy steps.
First, use the load_kernel function rather than get_locked_kernel:
from kernels import get_kernel
img2gray_lib = load_kernel("drbh/img2gray")
Because the name suggests, this function will only load a kernel, it is going to never attempt to download the kernel from the Hub. load_kernel will raise an exception if the kernel is just not locally available. So, how do you make the kernels locally available? The kernels utility has you covered! Running kernels download . will download the kernels which might be laid out in kernels.lock. So e.g. in a Docker container you may add a step:
RUN kernels download /path/to/your/project
and the kernels will get baked into your Docker image.
Kernels use the usual Hugging Face cache, so all HF_HOME caching rules apply.
Creating Legacy Python Wheels
We strongly recommend downloading kernels from the Hub using the kernels package. This has many advantages:
kernelssupports loading multiple kernel versions of the identical kernel in a Python process.kernelswill robotically download a version of a kernel that’s compatible with the CUDA and Torch versions of your environment.- You’ll get all the advantages of the Hub: analytics, issue tracking, pull requests, forks, etc.
- The Hub and
kernel-builderprovide provenance and reproducibility, a user can see a kernel’s source history and rebuild it in the identical construct environment for verification.
That said, some projects may require deployment of kernels as wheels. The kernels utility provides an easy solution to this. You possibly can convert any Hub kernel right into a set of wheels with a single command:
$ kernels to-wheel drbh/img2grey 1.1.2
☸ img2grey-1.1.2+torch27cu128cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch26cu124cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch26cu126cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch27cu126cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch26cu126cxx98-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch27cu128cxx11-cp39-abi3-manylinux_2_28_aarch64.whl
☸ img2grey-1.1.2+torch26cu126cxx98-cp39-abi3-manylinux_2_28_aarch64.whl
☸ img2grey-1.1.2+torch27cu126cxx11-cp39-abi3-manylinux_2_28_aarch64.whl
☸ img2grey-1.1.2+torch26cu126cxx11-cp39-abi3-manylinux_2_28_aarch64.whl
☸ img2grey-1.1.2+torch26cu118cxx98-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch26cu124cxx98-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch26cu118cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch27cu118cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
Each of those wheels will behave like all other Python wheel: the kernel might be imported using an easy import img2grey.
This guide has walked you thru the complete lifecycle of a production-ready CUDA kernel. You’ve seen the way to construct a custom kernel from the bottom up, register it as a native PyTorch operator, and share it with the community on the Hugging Face Hub. We also explored best practices for versioning, dependency management, and deployment, ensuring your work is each powerful and straightforward to keep up.
We consider that open and collaborative development is the important thing to innovation. Now that you may have the tools and knowledge to construct your personal high-performance kernels, we’re excited to see what you create! We warmly invite you to share your work, ask questions, and begin discussions on the Kernel Hub or in our kernel-builder GitHub repository and kernels GitHub repository. Whether you’re a seasoned developer or simply starting out, the community is here to support you.
Let’s get constructing! 🚀
