Learn the Hugging Face Kernel Hub in 5 Minutes

Boost your model performance with pre-optimized kernels, easily loaded from the Hub.

Today, we’ll explore an exciting development from Hugging Face: the Kernel Hub! As ML practitioners, we all know that maximizing performance often involves diving deep into optimized code, custom CUDA kernels, or complex construct systems. The Kernel Hub simplifies this process dramatically!

Below is a brief example of how you can use a kernel in your code.

import torch

from kernels import get_kernel


activation = get_kernel("kernels-community/activation")


x = torch.randn((10, 10), dtype=torch.float16, device="cuda")


y = torch.empty_like(x)
activation.gelu_fast(y, x)

print(y)

In the subsequent sections we’ll cover the next topics:

What’s the Kernel Hub? – Understanding the core concept.
Find out how to use the Kernel Hub – A fast code example.
Adding a Kernel to a Easy Model – A practical integration using RMSNorm.
Reviewing Performance Impact – Benchmarking the RMSNorm difference.
Real world use cases – Examples of how the kernels library is getting used in other projects.

We’ll introduce these concepts quickly – the core idea will be grasped in about 5 minutes (though experimenting and benchmarking might take a bit longer!).

1. What’s the Kernel Hub?

The Kernel Hub (👈 Test it out!) allows Python libraries and applications to load optimized compute kernels directly from the Hugging Face Hub. Consider it just like the Model Hub, but for low-level, high-performance code snippets (kernels) that speed up specific operations, often on GPUs.

Examples include advanced attention mechanisms (like FlashAttention for dramatic speedups and memory savings). Custom quantization kernels (enabling efficient computation with lower-precision data types like INT8 or INT4). Specialized kernels required for complex architectures like Mixture of Experts (MoE) layers, which involve intricate routing and computation patterns. In addition to activation functions, and normalization layers (like LayerNorm or RMSNorm).

As an alternative of manually managing complex dependencies, wrestling with compilation flags, or constructing libraries like Triton or CUTLASS from source, you need to use the kernels library to immediately fetch and run pre-compiled, optimized kernels.

For instance, to enable FlashAttention you would like only one line—no builds, no flags:

from kernels import get_kernel

flash_attention = get_kernel("kernels-community/flash-attn")

kernels detects your exact Python, PyTorch, and CUDA versions, then downloads the matching pre‑compiled binary—typically in seconds (or a minute or two on a slow connection).

Against this, compiling FlashAttention yourself requires:

Cloning the repository and installing every dependency.
Configuring construct flags and environment variables.
Reserving ~96 GB of RAM and many CPU cores.
Waiting 10 minutes to several hours, depending in your hardware.
(See the project’s own installation guide for details.)

Kernel Hub erases all that friction: one function call, fast acceleration.

Advantages of the Kernel Hub:

Easy Access to Optimized Kernels: Load and run kernels optimized for various hardware starting with NVIDIA and AMD GPUs, without local compilation hassles.
Share and Reuse: Discover, share, and reuse kernels across different projects and the community.
Easy Updates: Stay awake-to-date with the most recent kernel improvements just by pulling the most recent version from the Hub.
Speed up Development: Give attention to your model architecture and logic, not on the intricacies of kernel compilation and deployment.
Improve Performance: Leverage kernels optimized by experts to potentially speed up training and inference.
Simplify Deployment: Reduce the complexity of your deployment environment by fetching kernels on demand.
Develop and Share Your Own Kernels: When you create optimized kernels, you may easily share them on the Hub for others to make use of. This encourages collaboration and knowledge sharing throughout the community.

As many machine learning developers know, managing dependencies and constructing low-level code from source could be a time-consuming and error-prone process. The Kernel Hub goals to simplify this by providing a centralized repository of optimized compute kernels that will be easily loaded and run.

Spend more time constructing great models and fewer time fighting construct systems!

2. Find out how to Use the Kernel Hub (Basic Example)

Using the Kernel Hub is designed to be straightforward. The kernels library provides the essential interface. Here’s a fast example that loads an optimized GELU activation function kernel. (In a while, we’ll see one other example about how you can integrate a kernel in our model).

File: activation_validation_example.py









import torch
import torch.nn.functional as F
from kernels import get_kernel

DEVICE = "cuda"


torch.manual_seed(42)


activation_kernels = get_kernel("kernels-community/activation")


x = torch.randn((4, 4), dtype=torch.float16, device=DEVICE)


y = torch.empty_like(x)


activation_kernels.gelu_fast(y, x)


expected = F.gelu(x)


torch.testing.assert_close(y, expected, rtol=1e-2, atol=1e-2)

print("✅ Kernel output matches PyTorch GELU!")


print("nInput tensor:")
print(x)
print("nFast GELU kernel output:")
print(y)
print("nPyTorch GELU output:")
print(expected)


print("nAvailable functions in 'kernels-community/activation':")
print(dir(activation_kernels))

(Note: If you could have uv installed, you may save this script as script.py and run uv run script.py to mechanically handle dependencies.)

What’s happening here?

Import get_kernel: This function is the entry point to the Kernel Hub via the kernels library.
get_kernel("kernels-community/activation"): This line looks for the activation kernel repository under the kernels-community organization. It downloads, caches, and loads the suitable pre-compiled kernel binary.
Prepare Tensors: We create input (x) and output (y) tensors on the GPU.
activation_kernels.gelu_fast(y, x): We call the particular optimized function (gelu_fast) provided by the loaded kernel module.
Verification: We check the output.

This easy example shows how easily you may fetch and execute highly optimized code. Now let’s take a look at a more practical integration using RMS Normalization.

3. Add a Kernel to a Easy Model

Let’s integrate an optimized RMS Normalization kernel right into a basic model. We’ll use the LlamaRMSNorm implementation provided within the kernels-community/triton-layer-norm repository (note: this repo accommodates various normalization kernels) and compare it against a baseline PyTorch implementation of RMSNorm.

First, define a straightforward RMSNorm module in PyTorch and a baseline model using it:

File: rmsnorm_baseline.py








import torch
import torch.nn as nn

DEVICE = "cuda"

DTYPE = torch.float16  



class RMSNorm(nn.Module):
    def __init__(self, hidden_size, variance_epsilon=1e-5):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.eps = variance_epsilon
        self.hidden_size = hidden_size

    def forward(self, x):
        
        input_dtype = x.dtype
        
        variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)
        x = x * torch.rsqrt(variance + self.eps)

        
        return (self.weight * x).to(input_dtype)


class BaselineModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, eps=1e-5):
        super().__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)
        self.norm = RMSNorm(hidden_size, variance_epsilon=eps)
        self.activation = nn.GELU()
        self.linear2 = nn.Linear(hidden_size, output_size)

        
        with torch.no_grad():
            self.linear1.weight.fill_(1)
            self.linear1.bias.fill_(0)
            self.linear2.weight.fill_(1)
            self.linear2.bias.fill_(0)
            self.norm.weight.fill_(1)

    def forward(self, x):
        x = self.linear1(x)
        x = self.norm(x)  
        x = self.activation(x)
        x = self.linear2(x)
        return x



input_size = 128
hidden_size = 256
output_size = 10
eps_val = 1e-5

baseline_model = (
    BaselineModel(input_size, hidden_size, output_size, eps=eps_val)
    .to(DEVICE)
    .to(DTYPE)
)
dummy_input = torch.randn(32, input_size, device=DEVICE, dtype=DTYPE)  
output = baseline_model(dummy_input)
print("Baseline RMSNorm model output shape:", output.shape)

Now, let’s create a version using the LlamaRMSNorm kernel loaded via kernels.

File: rmsnorm_kernel.py








import torch
import torch.nn as nn
from kernels import get_kernel, use_kernel_forward_from_hub



from rmsnorm_baseline import BaselineModel

DEVICE = "cuda"
DTYPE = torch.float16  


layer_norm_kernel_module = get_kernel("kernels-community/triton-layer-norm")



















@use_kernel_forward_from_hub("LlamaRMSNorm")
class OriginalRMSNorm(nn.Module):
    def __init__(self, hidden_size, variance_epsilon=1e-5):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.eps = variance_epsilon
        self.hidden_size = hidden_size

    def forward(self, x):
        
        input_dtype = x.dtype
        
        variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)
        x = x * torch.rsqrt(variance + self.eps)

        
        return (self.weight * x).to(input_dtype)


class KernelModel(nn.Module):
    def __init__(
        self,
        input_size,
        hidden_size,
        output_size,
        device="cuda",
        dtype=torch.float16,
        eps=1e-5,
    ):
        super().__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)
        
        
        self.norm = OriginalRMSNorm(hidden_size, variance_epsilon=eps)
        self.activation = nn.GELU()
        self.linear2 = nn.Linear(hidden_size, output_size)

        
        with torch.no_grad():
            self.linear1.weight.fill_(1)
            self.linear1.bias.fill_(0)
            self.linear2.weight.fill_(1)
            self.linear2.bias.fill_(0)
            self.norm.weight.fill_(1)

    def forward(self, x):
        x = self.linear1(x)
        x = self.norm(x)
        x = self.activation(x)
        x = self.linear2(x)
        return x



input_size = 128
hidden_size = 256
output_size = 10
eps_val = 1e-5

kernel_model = (
    KernelModel(
        input_size, hidden_size, output_size, device=DEVICE, dtype=DTYPE, eps=eps_val
    )
    .to(DEVICE)
    .to(DTYPE)
)

baseline_model = (
    BaselineModel(input_size, hidden_size, output_size, eps=eps_val)
    .to(DEVICE)
    .to(DTYPE)
)

dummy_input = torch.randn(32, input_size, device=DEVICE, dtype=DTYPE)  

output = baseline_model(dummy_input)
output_kernel = kernel_model(dummy_input)
print("Kernel RMSNorm model output shape:", output_kernel.shape)


try:
    torch.testing.assert_close(output, output_kernel, rtol=1e-2, atol=1e-2)
    print("nBaseline and Kernel RMSNorm model outputs match!")
except AssertionError as e:
    print("nBaseline and Kernel RMSNorm model outputs differ barely:")
    print(e)
except NameError:
    print("nSkipping output comparison as kernel model output was not generated.")

Essential Notes on the KernelModel:

Kernel Inheritance: The KernelRMSNorm class inherits from layer_norm_kernel_module.layers.LlamaRMSNorm, which is the RMSNorm implementation within the kernel. This permits us to make use of the optimized kernel directly.
Accessing the Function: The precise solution to access the RMSNorm function (layer_norm_kernel_module.layers.LlamaRMSNorm.forward, layer_norm_kernel_module.rms_norm_forward, or something else) depends entirely on how the kernel creator structured the repository on the Hub. It’s possible you’ll need to examine the loaded layer_norm_kernel_module object (e.g., using dir()) or check the kernel’s documentation on the Hub to seek out the right function/method and its signature. I’ve used rms_norm_forward as a plausible placeholder and added error handling.
Parameters: We now only define rms_norm_weight (no bias), consistent with RMSNorm.

4. Benchmarking the Performance Impact

How much faster is the optimized Triton RMSNorm kernel in comparison with the usual PyTorch version? Let’s benchmark the forward pass to seek out out.

File: rmsnorm_benchmark.py








import torch



from rmsnorm_baseline import BaselineModel
from rmsnorm_kernel import KernelModel

DEVICE = "cuda"
DTYPE = torch.float16  



def benchmark_model(model, input_tensor, num_runs=100, warmup_runs=10):
    model.eval()  
    dtype = input_tensor.dtype
    model = model.to(input_tensor.device).to(dtype)

    
    for _ in range(warmup_runs):
        _ = model(input_tensor)
    torch.cuda.synchronize()

    
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_runs):
        _ = model(input_tensor)
    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_runs
    return avg_time_ms


input_size_bench = 4096
hidden_size_bench = 4096  
output_size_bench = 10
eps_val_bench = 1e-5



baseline_model_bench = (
    BaselineModel(
        input_size_bench, hidden_size_bench, output_size_bench, eps=eps_val_bench
    )
    .to(DEVICE)
    .to(DTYPE)
)
kernel_model_bench = (
    KernelModel(
        input_size_bench,
        hidden_size_bench,
        output_size_bench,
        device=DEVICE,
        dtype=DTYPE,
        eps=eps_val_bench,
    )
    .to(DEVICE)
    .to(DTYPE)
)



warmup_input = torch.randn(4096, input_size_bench, device=DEVICE, dtype=DTYPE)
_ = kernel_model_bench(warmup_input)
_ = baseline_model_bench(warmup_input)

batch_sizes = [
    256,
    512,
    1024,
    2048,
    4096,
    8192,
    16384,
    32768,
]

print(
    f"{'Batch Size':<12} | {'Baseline Time (ms)':<18} | {'Kernel Time (ms)':<18} | {'Speedup'}"
)
print("-" * 74)

for batch_size in batch_sizes:
    
    torch.cuda.synchronize()

    
    
    bench_input = torch.randn(batch_size, input_size_bench, device=DEVICE, dtype=DTYPE)

    
    baseline_time = benchmark_model(baseline_model_bench, bench_input)

    kernel_time = -1  

    kernel_time = benchmark_model(kernel_model_bench, bench_input)

    baseline_time = round(baseline_time, 4)
    kernel_time = round(kernel_time, 4)
    speedup = round(baseline_time / kernel_time, 2) if kernel_time > 0 else "N/A"
    if kernel_time < baseline_time:
        speedup = f"{speedup:.2f}x"
    elif kernel_time == baseline_time:
        speedup = "1.00x (equivalent)"
    else:
        speedup = f"{kernel_time / baseline_time:.2f}x slower"
    print(f"{batch_size:<12} | {baseline_time:<18} | {kernel_time:<18} | {speedup}")

Expected End result:
As with LayerNorm, a well-tuned RMSNorm implementation using Triton can deliver substantial speedups over PyTorch’s default version—especially for memory-bound workloads on compatible hardware (e.g., NVIDIA Ampere or Hopper GPUs) and with low-precision types like float16 or bfloat16.

Keep in Mind:

Results may vary depending in your GPU, input size, and data type.
Microbenchmarks can misrepresent real-world performance.
Performance hinges on the standard of the kernel implementation.
Optimized kernels may not profit small batch sizes as a consequence of overhead.

Actual results will rely on your hardware and the particular kernel implementation. Here’s an example of what you would possibly see (on a L4 GPU):

Batch Size	Baseline Time (ms)	Kernel Time (ms)	Speedup
256	0.2122	0.2911	0.72x
512	0.4748	0.3312	1.43x
1024	0.8946	0.6864	1.30x
2048	2.0289	1.3889	1.46x
4096	4.4318	2.2467	1.97x
8192	9.2438	4.8497	1.91x
16384	18.6992	9.8805	1.89x
32768	37.079	19.9461	1.86x
65536	73.588	39.593	1.86x

5. Real World Use Cases

The kernels library remains to be growing but is already getting used in various real world projects, including:

Text Generation Inference: The TGI project uses the kernels library to load optimized kernels for text generation tasks, improving performance and efficiency.
Transformers: The Transformers library has integrated the kernels library to make use of drop in optimized layers without requiring any changes to the model code. This permits users to simply switch between standard and optimized implementations.

Get Began and Next Steps!

You have seen how easy it’s to fetch and use optimized kernels with the Hugging Face Kernel Hub. Able to try it yourself?

Install the library:
```
pip install kernels torch numpy
```
Ensure you could have a compatible PyTorch version and gpu driver installed.
Browse the Hub: Explore available kernels on the Hugging Face Hub under the kernels tag or inside organizations like kernels-community. Search for kernels relevant to your operations (activations, attention, normalization like LayerNorm/RMSNorm, etc.).
Experiment: Try replacing components in your individual models. Use get_kernel("user-or-org/kernel-name"). Crucially, inspect the loaded kernel object (e.g., print(dir(loaded_kernel))) or check its Hub repository documentation to know how you can appropriately call its functions/methods and what parameters (weights, biases, inputs, epsilon) it expects.
Benchmark: Measure the performance impact in your specific hardware and workload. Remember to ascertain for numerical correctness (torch.testing.assert_close).
(Advanced) Contribute: When you develop optimized kernels, consider sharing them on the Hub!

Conclusion

The Hugging Face Kernel Hub provides a strong yet easy solution to access and leverage optimized compute kernels. By replacing standard PyTorch components with optimized versions for operations like RMS Normalization, you may potentially unlock significant performance improvements without the standard complexities of custom builds. Remember to ascertain the specifics of every kernel on the Hub for proper usage. Give it a attempt to see how it could speed up your workflows!

Source link

Learn the Hugging Face Kernel Hub in 5 Minutes

1. What’s the Kernel Hub?

Advantages of the Kernel Hub:

2. Find out how to Use the Kernel Hub (Basic Example)

What’s happening here?

3. Add a Kernel to a Easy Model

4. Benchmarking the Performance Impact

5. Real World Use Cases

Get Began and Next Steps!

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The best way to Do Evals on a Bloated RAG Pipeline

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

Tools for Your LLM: a Deep Dive into MCP

Hugging Face models in Amazon Bedrock

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Learn the Hugging Face Kernel Hub in 5 Minutes

1. What’s the Kernel Hub?

Advantages of the Kernel Hub:

2. Find out how to Use the Kernel Hub (Basic Example)

What’s happening here?

3. Add a Kernel to a Easy Model

4. Benchmarking the Performance Impact

5. Real World Use Cases

Get Began and Next Steps!

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.