A Guide to Constructing and Scaling Production-Ready CUDA Kernels

Custom CUDA kernels give your models a serious performance edge, but constructing them for the actual world can feel daunting. How do you progress beyond an easy GPU function to create a strong, scalable system without getting bogged down by countless construct times and dependency nightmares?

We created the kernel-builder library for this purpose. You possibly can develop a custom kernel locally, after which construct it for multiple architectures and make it available for the world to make use of.

On this guide we’ll show you the way to construct a whole, modern CUDA kernel from the bottom up. Then, we’ll tackle the tough production and deployment challenges, drawing on real-world engineering strategies to point out you the way to construct systems that usually are not just fast, but additionally efficient and maintainable.

What You’ll Learn

Once you’re done, other developers will have the opportunity to make use of your kernels directly from the hub like this:

import torch

from kernels import get_kernel


optimized_kernel = get_kernel("your-username/optimized-kernel")


some_input = torch.randn((10, 10), device="cuda")


out = optimized_kernel.my_kernel_function(some_input)

print(out)

Reasonably watch a video? Take a look at the YouTube video that accompanies this guide.

Let’s Get Began! 🚀

Part 1: Anatomy of a Modern CUDA Kernel

Let’s construct a practical kernel that converts a picture from RGB to grayscale. This instance uses PyTorch’s modern C++ API to register our function as a first-class, native operator.

Step 1: Project Structure

A clean, predictable structure is the inspiration of an excellent project. The Hugging Face Kernel Builder expects your files to be organized like this:

img2gray/
├── construct.toml
├── csrc
│   └── img2gray.cu
├── flake.nix
└── torch-ext
    ├── torch_binding.cpp
    ├── torch_binding.h
    └── img2gray
        └── __init__.py

construct.toml: The project manifest; it’s the brain of the construct process.
csrc/: Your raw CUDA source code where the GPU magic happens.
flake.nix: The important thing to a superbly reproducible* construct environment.
torch-ext/img2gray/: The Python wrapper for the raw PyTorch operators.

Step 2: The `construct.toml` Manifest

This file orchestrates the complete construct. It tells the kernel-builder what to compile and the way all the pieces connects.


[general]
name = "img2gray"
universal = false


[torch]
src = [
  "torch-ext/torch_binding.cpp",
  "torch-ext/torch_binding.h"
]


[kernel.img2gray]
backend = "cuda"
depends = ["torch"] 
src = [
    "csrc/img2gray.cu",
]

Step 3: The `flake.nix` Reproducibility File

To make sure anyone can construct your kernel on any machine, we use a flake.nix file. It locks the precise version of the kernel-builder and its dependencies, eliminating “it really works on my machine” issues.


{
  description = "Flake for img2gray kernel";

  inputs = {
    kernel-builder.url = "github:huggingface/kernel-builder";
  };

  outputs =
    {
      self,
      kernel-builder,
    }:
    kernel-builder.lib.genFlakeOutputs {
      path = ./.;
      rev = self.shortRev or self.dirtyShortRev or self.lastModifiedDate;
    };
}

Step 4: Writing the CUDA Kernel

Now for the GPU code. Inside csrc/img2gray.cu, we’ll define a kernel that uses a 2D grid of threads—a natural and efficient fit for processing images.


#include 
#include 


__global__ void img2gray_kernel(const uint8_t* input, uint8_t* output, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x < width && y < height) {
        int idx = (y * width + x) * 3; 
        uint8_t r = input[idx];
        uint8_t g = input[idx + 1];
        uint8_t b = input[idx + 2];

        
        uint8_t gray = static_cast<uint8_t>(0.21f * r + 0.72f * g + 0.07f * b);
        output[y * width + x] = gray;
    }
}


void img2gray_cuda(torch::Tensor const &input, torch::Tensor &output) {
    const int width = input.size(1);
    const int height = input.size(0);

    
    const dim3 blockSize(16, 16);
    const dim3 gridSize((width + blockSize.x - 1) / blockSize.x, (height + blockSize.y - 1) / blockSize.y);

    img2gray_kernel<<>>(
        input.data_ptr<uint8_t>(),
        output.data_ptr<uint8_t>(),
        width,
        height
    );
}

Step 5: Registering a Native PyTorch Operator

That is an important step. We’re not only binding to Python; we’re registering our function as a native PyTorch operator. This makes it a first-class citizen within the PyTorch ecosystem, visible under the torch.ops namespace.

The file torch-ext/torch_binding.cpp handles this registration.


#include 
#include "registration.h" 
#include "torch_binding.h" 

TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
  ops.def("img2gray(Tensor input, Tensor! output) -> ()");
  ops.impl("img2gray", torch::kCUDA, &img2gray_cuda);
}

REGISTER_EXTENSION(TORCH_EXTENSION_NAME)

In easy terms, TORCH_LIBRARY_EXPAND allows us to define our operator in a way that might be easily prolonged or modified in the longer term.

Why This Matters

This approach is crucial for 2 important reasons:

Compatibility with torch.compile: By registering our kernel this fashion, torch.compile can “see” it. This permits PyTorch to fuse your custom operator into larger computation graphs, minimizing overhead and maximizing performance. It’s the important thing to creating your custom code work seamlessly with PyTorch’s broader performance ecosystem.
Hardware-Specific Implementations: This method permits you to provide different backends for a similar operator. You can add one other TORCH_LIBRARY_IMPL(img2gray, CPU, ...) block pointing to a C++ CPU function. PyTorch’s dispatcher would then robotically call the proper implementation (CUDA or CPU) based on the input tensor’s device, making your code powerful and portable.

Establishing the __init__.py wrapper

In torch-ext/img2gray/ we’d like an __init__.py file to make this directory a Python package and to show our custom operator in a user-friendly way.

The _ops module is auto-generated by kernel-builder from a template to supply a regular namespace on your registered C++ functions.


import torch

from ._ops import ops


def img2gray(input: torch.Tensor) -> torch.Tensor:
    
    height, width, channels = input.shape
    assert channels == 3, "Input image will need to have 3 channels (RGB)"

    output = torch.empty((height, width), device=input.device, dtype=input.dtype)
    ops.img2gray(input, output)

    return output

Step 6: Constructing the Kernel

Now that our kernel and its bindings are ready, it is time to construct them. The kernel-builder tool simplifies this process.

You possibly can construct your kernel with a single command, nix construct . -L; nonetheless, as developers, we’ll desire a faster, more iterative workflow. For that, we’ll use the nix develop command to enter a development shell with all of the needed dependencies pre-installed.

More specifically, we will select the precise CUDA and PyTorch versions we would like to make use of. For instance, to construct our kernel for PyTorch 2.7 with CUDA 12.6, we will use the next command:

Drop right into a Nix Shell


nix develop .

Note that the devShell name above might be deciphered as:

nix develop .
                        │       │         │       │
                        │       │         │       └─── Architecture: x86_64 (Linux)
                        │       │         └────────── CUDA version: 12.6
                        │       └──────────────────── C++ ABI: cxx11
                        └──────────────────────────── Torch version: 2.7

At this point, we’ll be inside a Nix shell with all dependencies installed. We will now construct the kernel for this particular architecture and test it. In a while, we’ll cope with multiple architectures before distributing the ultimate version of the kernel.

Set Up Construct Artifacts

build2cmake generate-torch construct.toml

This command creates a handful of files used to construct the kernel: CMakeLists.txt, pyproject.toml, setup.py, and a cmake directory. The CMakeLists.txt file is the important entry point for CMake to construct the kernel.

Create a Python Virtual Environment

python -m venv .venv
source .venv/bin/activate

Now you may install the kernel in editable mode.

Compile the Kernel and Install the Python Package

pip install --no-build-isolation -e .

🙌 Amazing! We now have a custom built kernel that follows best practices for PyTorch bindings, with a completely reproducible construct process.

Development Cycle

To make sure all the pieces is working accurately, we will run an easy test to ascertain that the kernel is registered and it really works as expected. If it doesn’t, you may iterate by editing the source files and repeating the construct, reusing the nix environment you created.


import torch
import img2gray

from PIL import Image
import numpy as np

print(dir(img2gray))

img = Image.open("kernel-builder-logo-color.png").convert("RGB")
img = np.array(img)
img_tensor = torch.from_numpy(img).cuda()
print(img_tensor.shape)  

gray_tensor = img2gray.img2gray(img_tensor).squeeze()
print(gray_tensor.shape)  


gray_img = Image.fromarray(gray_tensor.cpu().numpy().astype(np.uint8), mode="L")
gray_img.save("kernel-builder-logo-gray.png")

Step 7: Sharing with the World

Now that we now have a working kernel, it is time to share it with other developers and the world!

One small thing we’ll need to do before we share, is clean up all the development artifacts that were generated throughout the construct process to avoid uploading unnecessary files.

build2cmake clean construct.toml

Constructing the Kernel for All PyTorch and CUDA Versions

Earlier, we built the kernel for a particular version of PyTorch and CUDA. Nonetheless, to make it available to a wider audience, we’d like to construct it for all supported versions. The kernel-builder tool might help us with that.

This can be where the concept of a compliant kernel comes into play. A compliant kernel is one which might be built and run for all supported versions of PyTorch and CUDA. Generally, this requires custom configuration; nonetheless, in our case, the kernel-builder tool will automate the method.



nix construct . -L

This process may take some time, as it is going to construct the kernel for all supported versions of PyTorch and CUDA. The output might be within the result directory.

The kernel-builder team actively maintains the supported construct variants, keeping them current with the most recent PyTorch and CUDA releases while also supporting trailing versions for broader compatibility.

The last step is to maneuver the outcomes into the expected construct directory (that is where the kernels library will search for them).

mkdir -p construct
rsync -av --delete --chmod=Du+w,Fu+w result/ construct/

Pushing to the Hugging Face Hub

Pushing the construct artifacts to the Hub will make it straightforward for other developers to make use of your kernel, as we saw in our previous post.

First, create a brand new repo:

hf repo create img2gray

Be sure you might be logged in to the Hugging Face Hub using huggingface-cli login.

Now, in your project directory, connect your project to the brand new repository and push your code:


git init
git distant add origin https://huggingface.co//img2gray


git pull origin important
git lfs install
git checkout -b important


git lfs track "*.so"



git add 
  construct/ csrc/ 
  torch-ext/torch_binding.cpp torch-ext/torch_binding.h torch-ext/img2gray 
  flake.nix flake.lock construct.toml

git commit -m "feat: Created a compliant img2gray kernel"
git push -u origin important

Unbelievable! Your kernel is now on the Hugging Face Hub, ready for others to make use of and fully compliant with the kernels library. Our kernel and all of its construct variants at the moment are available at drbh/img2gray.

Step 8: Loading and Testing Your Custom Op

With the kernels library, you do not “install” the kernel in the standard sense. You load it directly from its Hub repository, which robotically registers the brand new operator.










import torch
from PIL import Image
import numpy as np
from kernels import get_kernel



img2gray_lib = get_kernel("drbh/img2gray")

img = Image.open("kernel-builder-logo-color.png").convert("RGB")
img = np.array(img)
img_tensor = torch.from_numpy(img).cuda()
print(img_tensor.shape)  

gray_tensor = img2gray_lib.img2gray(img_tensor).squeeze()
print(gray_tensor.shape)  


gray_img = Image.fromarray(gray_tensor.cpu().numpy().astype(np.uint8), mode="L")
gray_img.save("kernel-builder-logo-gray2.png")

Part 2: From One Kernel to Many: Solving Production Challenges

Once you may have a ready-to-use kernel, there are some things you may do to make it easier to deploy your kernel. We are going to discuss using versioning as a tool to make API changes without breaking downstream use of kernels. After that, we’ll wrap up showing how you may make Python wheels on your kernel.

Kernel Versions

You would possibly determine to update your kernel after some time. Possibly you may have found recent ways of improving performance or perhaps you want to to increase the kernel’s functionality. Some changes would require you to vary the API of your kernel. As an illustration, a more moderen version might add a brand new mandatory argument to one in all the general public functions. This might be inconvenient to downstream users, because their code would break until they add this recent argument.

A downstream user of a kernel can avoid such breakage by pinning the kernel that they use to a specific revision. As an illustration, since each Hub repository can be a Git repository, they may use a git commit shorthash to pin the kernel to a revision:

from kernels import get_kernel
img2gray_lib = get_kernel("drbh/img2gray", revision="4148918")

Using a Git shorthash will reduce the prospect of breakage; nonetheless, it is tough to interpret and doesn’t allow graceful upgrades inside a version range. We due to this fact recommend using the familiar semantic versioning system for Hub kernels. Adding a version to a kernel is straightforward: you just add a Git tag of the shape vx.y.z where x.y.z is the version. As an illustration, if the present version of the kernel is 1.1.2, you may tag it as v1.1.2. You possibly can then get that version with get_kernel:

from kernels import get_kernel
img2gray_lib = get_kernel("drbh/img2gray", revision="v1.1.2")

Versioning becomes much more powerful with version bounds. In semantic versioning, the version 1.y.z, must not have backward-incompatible changes in the general public API for every succeeding x and y. So, if the kernel’s version was 1.1.2 on the time of writing your code, you may ask the version to be a minimum of 1.1.2, but lower than 2.0.0:

from kernels import get_kernel
img2gray_lib = get_kernel("drbh/img2gray", version=">=1.1.2,<2")

This can be sure that the code will at all times fetch the most recent kernel from the 1.y.z series. The version certain could be a Python-style version specifier.

You possibly can tag a version with huggingface-cli:

$ huggingface-cli tag drbh/img2gray v1.1.2

Locking Kernels

In large projects, it’s possible you’ll need to coordinate the kernel versions globally somewhat than in each get_kernel call. Furthermore, it is usually useful to lock kernels, so that every one your users have the identical kernel versions, which aids handling bug reports.

The kernels library offers a pleasant way of managing kernels on the project-level. To accomplish that, add the kernels package to the build-system requirements of your project, within the pyproject.toml file. After doing so, you may specify your project’s kernel requirements within the tools.kernels section:

[build-system]
requires = ["kernels", "setuptools"]
build-backend = "setuptools.build_meta"

[tool.kernels.dependencies]
"drbh/img2gray" = ">=0.1.2,<0.2.0"

The version might be specified with the identical sort of version specifiers as Python dependencies. That is one other place where the version tags (va.b.c) come handy — kernels will use a repository’s version tags to question what versions can be found. After specifying a kernel in pyproject.toml, you may lock it to a particular version using the kernels command-line utility. This utility is an element of the kernels Python package:

$ kernels lock .

This generates a kernels.lock file with the most recent kernel versions which might be compatible with the bounds which might be laid out in pyproject.toml. kernels.lock ought to be committed to your project’s Git repository, in order that every user of the project will get the locked kernel versions. When newer kernels versions are released, you may run kernels lock again to update the lock file.

You would like one last bit to totally implement locked kernels in a project. The get_locked_kernel is the counterpart to get_kernel that uses locked kernels. So to make use of locked kernels, replace every occurrence of get_kernel with get_locked_kernel:

from kernels import get_kernel
img2gray_lib = get_locked_kernel("drbh/img2gray")

That is it! Every call of get_locked_kernel("drbh/img2gray") within the project will now use the version laid out in kernels.lock.

Pre-downloading Locked Kernels

The get_locked_kernel function will download the kernel when it is just not available within the local Hub cache. This is just not ideal for applications where you are not looking for to download binaries at runtime. For instance, when you’re constructing a Docker image for an application, you often want the kernels to be stored within the image together with the appliance. This might be done in two easy steps.

First, use the load_kernel function rather than get_locked_kernel:

from kernels import get_kernel
img2gray_lib = load_kernel("drbh/img2gray")

Because the name suggests, this function will only load a kernel, it is going to never attempt to download the kernel from the Hub. load_kernel will raise an exception if the kernel is just not locally available. So, how do you make the kernels locally available? The kernels utility has you covered! Running kernels download . will download the kernels which might be laid out in kernels.lock. So e.g. in a Docker container you may add a step:

RUN kernels download /path/to/your/project

and the kernels will get baked into your Docker image.

Kernels use the usual Hugging Face cache, so all HF_HOME caching rules apply.

Creating Legacy Python Wheels

We strongly recommend downloading kernels from the Hub using the kernels package. This has many advantages:

kernels supports loading multiple kernel versions of the identical kernel in a Python process.
kernels will robotically download a version of a kernel that’s compatible with the CUDA and Torch versions of your environment.
You’ll get all the advantages of the Hub: analytics, issue tracking, pull requests, forks, etc.
The Hub and kernel-builder provide provenance and reproducibility, a user can see a kernel’s source history and rebuild it in the identical construct environment for verification.

That said, some projects may require deployment of kernels as wheels. The kernels utility provides an easy solution to this. You possibly can convert any Hub kernel right into a set of wheels with a single command:

$ kernels to-wheel drbh/img2grey 1.1.2
☸ img2grey-1.1.2+torch27cu128cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch26cu124cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch26cu126cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch27cu126cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch26cu126cxx98-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch27cu128cxx11-cp39-abi3-manylinux_2_28_aarch64.whl
☸ img2grey-1.1.2+torch26cu126cxx98-cp39-abi3-manylinux_2_28_aarch64.whl
☸ img2grey-1.1.2+torch27cu126cxx11-cp39-abi3-manylinux_2_28_aarch64.whl
☸ img2grey-1.1.2+torch26cu126cxx11-cp39-abi3-manylinux_2_28_aarch64.whl
☸ img2grey-1.1.2+torch26cu118cxx98-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch26cu124cxx98-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch26cu118cxx11-cp39-abi3-manylinux_2_28_x86_64.whl
☸ img2grey-1.1.2+torch27cu118cxx11-cp39-abi3-manylinux_2_28_x86_64.whl

Each of those wheels will behave like all other Python wheel: the kernel might be imported using an easy import img2grey.

This guide has walked you thru the complete lifecycle of a production-ready CUDA kernel. You’ve seen the way to construct a custom kernel from the bottom up, register it as a native PyTorch operator, and share it with the community on the Hugging Face Hub. We also explored best practices for versioning, dependency management, and deployment, ensuring your work is each powerful and straightforward to keep up.

We consider that open and collaborative development is the important thing to innovation. Now that you may have the tools and knowledge to construct your personal high-performance kernels, we’re excited to see what you create! We warmly invite you to share your work, ask questions, and begin discussions on the Kernel Hub or in our kernel-builder GitHub repository and kernels GitHub repository. Whether you’re a seasoned developer or simply starting out, the community is here to support you.

Let’s get constructing! 🚀

Source link

A Guide to Constructing and Scaling Production-Ready CUDA Kernels

What You’ll Learn

Part 1: Anatomy of a Modern CUDA Kernel

Step 1: Project Structure

Step 2: The `construct.toml` Manifest

Step 3: The `flake.nix` Reproducibility File

Step 4: Writing the CUDA Kernel

Step 5: Registering a Native PyTorch Operator

Why This Matters

Step 6: Constructing the Kernel

Drop right into a Nix Shell

Set Up Construct Artifacts

Create a Python Virtual Environment

Compile the Kernel and Install the Python Package

Development Cycle

Step 7: Sharing with the World

Constructing the Kernel for All PyTorch and CUDA Versions

Pushing to the Hugging Face Hub

Step 8: Loading and Testing Your Custom Op

Part 2: From One Kernel to Many: Solving Production Challenges

Kernel Versions

Locking Kernels

Pre-downloading Locked Kernels

Creating Legacy Python Wheels

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Reconstruct a Scene in NVIDIA Isaac Sim Using Only a Smartphone

A guide to Efficient Multi-GPU Training

How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks

a tool to work with datasets using open AI models!

Solve Linear Programs Using the GPU-Accelerated Barrier Method in NVIDIA cuOpt

A Guide to Constructing and Scaling Production-Ready CUDA Kernels

What You’ll Learn

Part 1: Anatomy of a Modern CUDA Kernel

Step 1: Project Structure

Step 2: The construct.toml Manifest

Step 3: The flake.nix Reproducibility File

Step 4: Writing the CUDA Kernel

Step 5: Registering a Native PyTorch Operator

Why This Matters

Step 6: Constructing the Kernel

Drop right into a Nix Shell

Set Up Construct Artifacts

Create a Python Virtual Environment

Compile the Kernel and Install the Python Package

Development Cycle

Step 7: Sharing with the World

Constructing the Kernel for All PyTorch and CUDA Versions

Pushing to the Hugging Face Hub

Step 8: Loading and Testing Your Custom Op

Part 2: From One Kernel to Many: Solving Production Challenges

Kernel Versions

Locking Kernels

Pre-downloading Locked Kernels

Creating Legacy Python Wheels

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Step 2: The `construct.toml` Manifest

Step 3: The `flake.nix` Reproducibility File

What are your thoughts on this topic?
Let us know in the comments below.