Boost your model performance with pre-optimized kernels, easily loaded from the Hub.
Today, we’ll explore an exciting development from Hugging Face: the Kernel Hub! As ML practitioners, we all know that maximizing performance often involves diving deep into optimized code, custom CUDA kernels, or complex construct systems. The Kernel Hub simplifies this process dramatically!
Below is a brief example of how you can use a kernel in your code.
import torch
from kernels import get_kernel
activation = get_kernel("kernels-community/activation")
x = torch.randn((10, 10), dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
print(y)
In the subsequent sections we’ll cover the next topics:
- What’s the Kernel Hub? – Understanding the core concept.
- Find out how to use the Kernel Hub – A fast code example.
- Adding a Kernel to a Easy Model – A practical integration using RMSNorm.
- Reviewing Performance Impact – Benchmarking the RMSNorm difference.
- Real world use cases – Examples of how the kernels library is getting used in other projects.
We’ll introduce these concepts quickly – the core idea will be grasped in about 5 minutes (though experimenting and benchmarking might take a bit longer!).
1. What’s the Kernel Hub?
The Kernel Hub (👈 Test it out!) allows Python libraries and applications to load optimized compute kernels directly from the Hugging Face Hub. Consider it just like the Model Hub, but for low-level, high-performance code snippets (kernels) that speed up specific operations, often on GPUs.
Examples include advanced attention mechanisms (like FlashAttention for dramatic speedups and memory savings). Custom quantization kernels (enabling efficient computation with lower-precision data types like INT8 or INT4). Specialized kernels required for complex architectures like Mixture of Experts (MoE) layers, which involve intricate routing and computation patterns. In addition to activation functions, and normalization layers (like LayerNorm or RMSNorm).
As an alternative of manually managing complex dependencies, wrestling with compilation flags, or constructing libraries like Triton or CUTLASS from source, you need to use the kernels library to immediately fetch and run pre-compiled, optimized kernels.
For instance, to enable FlashAttention you would like only one line—no builds, no flags:
from kernels import get_kernel
flash_attention = get_kernel("kernels-community/flash-attn")
kernels detects your exact Python, PyTorch, and CUDA versions, then downloads the matching pre‑compiled binary—typically in seconds (or a minute or two on a slow connection).
Against this, compiling FlashAttention yourself requires:
- Cloning the repository and installing every dependency.
- Configuring construct flags and environment variables.
- Reserving ~96 GB of RAM and many CPU cores.
- Waiting 10 minutes to several hours, depending in your hardware.
(See the project’s own installation guide for details.)
Kernel Hub erases all that friction: one function call, fast acceleration.
Advantages of the Kernel Hub:
- Easy Access to Optimized Kernels: Load and run kernels optimized for various hardware starting with NVIDIA and AMD GPUs, without local compilation hassles.
- Share and Reuse: Discover, share, and reuse kernels across different projects and the community.
- Easy Updates: Stay awake-to-date with the most recent kernel improvements just by pulling the most recent version from the Hub.
- Speed up Development: Give attention to your model architecture and logic, not on the intricacies of kernel compilation and deployment.
- Improve Performance: Leverage kernels optimized by experts to potentially speed up training and inference.
- Simplify Deployment: Reduce the complexity of your deployment environment by fetching kernels on demand.
- Develop and Share Your Own Kernels: When you create optimized kernels, you may easily share them on the Hub for others to make use of. This encourages collaboration and knowledge sharing throughout the community.
As many machine learning developers know, managing dependencies and constructing low-level code from source could be a time-consuming and error-prone process. The Kernel Hub goals to simplify this by providing a centralized repository of optimized compute kernels that will be easily loaded and run.
Spend more time constructing great models and fewer time fighting construct systems!
2. Find out how to Use the Kernel Hub (Basic Example)
Using the Kernel Hub is designed to be straightforward. The kernels library provides the essential interface. Here’s a fast example that loads an optimized GELU activation function kernel. (In a while, we’ll see one other example about how you can integrate a kernel in our model).
File: activation_validation_example.py
import torch
import torch.nn.functional as F
from kernels import get_kernel
DEVICE = "cuda"
torch.manual_seed(42)
activation_kernels = get_kernel("kernels-community/activation")
x = torch.randn((4, 4), dtype=torch.float16, device=DEVICE)
y = torch.empty_like(x)
activation_kernels.gelu_fast(y, x)
expected = F.gelu(x)
torch.testing.assert_close(y, expected, rtol=1e-2, atol=1e-2)
print("✅ Kernel output matches PyTorch GELU!")
print("nInput tensor:")
print(x)
print("nFast GELU kernel output:")
print(y)
print("nPyTorch GELU output:")
print(expected)
print("nAvailable functions in 'kernels-community/activation':")
print(dir(activation_kernels))
(Note: If you could have uv installed, you may save this script as script.py and run uv run script.py to mechanically handle dependencies.)
What’s happening here?
- Import
get_kernel: This function is the entry point to the Kernel Hub via thekernelslibrary. get_kernel("kernels-community/activation"): This line looks for theactivationkernel repository under thekernels-communityorganization. It downloads, caches, and loads the suitable pre-compiled kernel binary.- Prepare Tensors: We create input (
x) and output (y) tensors on the GPU. activation_kernels.gelu_fast(y, x): We call the particular optimized function (gelu_fast) provided by the loaded kernel module.- Verification: We check the output.
This easy example shows how easily you may fetch and execute highly optimized code. Now let’s take a look at a more practical integration using RMS Normalization.
3. Add a Kernel to a Easy Model
Let’s integrate an optimized RMS Normalization kernel right into a basic model. We’ll use the LlamaRMSNorm implementation provided within the kernels-community/triton-layer-norm repository (note: this repo accommodates various normalization kernels) and compare it against a baseline PyTorch implementation of RMSNorm.
First, define a straightforward RMSNorm module in PyTorch and a baseline model using it:
File: rmsnorm_baseline.py
import torch
import torch.nn as nn
DEVICE = "cuda"
DTYPE = torch.float16
class RMSNorm(nn.Module):
def __init__(self, hidden_size, variance_epsilon=1e-5):
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.eps = variance_epsilon
self.hidden_size = hidden_size
def forward(self, x):
input_dtype = x.dtype
variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)
x = x * torch.rsqrt(variance + self.eps)
return (self.weight * x).to(input_dtype)
class BaselineModel(nn.Module):
def __init__(self, input_size, hidden_size, output_size, eps=1e-5):
super().__init__()
self.linear1 = nn.Linear(input_size, hidden_size)
self.norm = RMSNorm(hidden_size, variance_epsilon=eps)
self.activation = nn.GELU()
self.linear2 = nn.Linear(hidden_size, output_size)
with torch.no_grad():
self.linear1.weight.fill_(1)
self.linear1.bias.fill_(0)
self.linear2.weight.fill_(1)
self.linear2.bias.fill_(0)
self.norm.weight.fill_(1)
def forward(self, x):
x = self.linear1(x)
x = self.norm(x)
x = self.activation(x)
x = self.linear2(x)
return x
input_size = 128
hidden_size = 256
output_size = 10
eps_val = 1e-5
baseline_model = (
BaselineModel(input_size, hidden_size, output_size, eps=eps_val)
.to(DEVICE)
.to(DTYPE)
)
dummy_input = torch.randn(32, input_size, device=DEVICE, dtype=DTYPE)
output = baseline_model(dummy_input)
print("Baseline RMSNorm model output shape:", output.shape)
Now, let’s create a version using the LlamaRMSNorm kernel loaded via kernels.
File: rmsnorm_kernel.py
import torch
import torch.nn as nn
from kernels import get_kernel, use_kernel_forward_from_hub
from rmsnorm_baseline import BaselineModel
DEVICE = "cuda"
DTYPE = torch.float16
layer_norm_kernel_module = get_kernel("kernels-community/triton-layer-norm")
@use_kernel_forward_from_hub("LlamaRMSNorm")
class OriginalRMSNorm(nn.Module):
def __init__(self, hidden_size, variance_epsilon=1e-5):
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.eps = variance_epsilon
self.hidden_size = hidden_size
def forward(self, x):
input_dtype = x.dtype
variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)
x = x * torch.rsqrt(variance + self.eps)
return (self.weight * x).to(input_dtype)
class KernelModel(nn.Module):
def __init__(
self,
input_size,
hidden_size,
output_size,
device="cuda",
dtype=torch.float16,
eps=1e-5,
):
super().__init__()
self.linear1 = nn.Linear(input_size, hidden_size)
self.norm = OriginalRMSNorm(hidden_size, variance_epsilon=eps)
self.activation = nn.GELU()
self.linear2 = nn.Linear(hidden_size, output_size)
with torch.no_grad():
self.linear1.weight.fill_(1)
self.linear1.bias.fill_(0)
self.linear2.weight.fill_(1)
self.linear2.bias.fill_(0)
self.norm.weight.fill_(1)
def forward(self, x):
x = self.linear1(x)
x = self.norm(x)
x = self.activation(x)
x = self.linear2(x)
return x
input_size = 128
hidden_size = 256
output_size = 10
eps_val = 1e-5
kernel_model = (
KernelModel(
input_size, hidden_size, output_size, device=DEVICE, dtype=DTYPE, eps=eps_val
)
.to(DEVICE)
.to(DTYPE)
)
baseline_model = (
BaselineModel(input_size, hidden_size, output_size, eps=eps_val)
.to(DEVICE)
.to(DTYPE)
)
dummy_input = torch.randn(32, input_size, device=DEVICE, dtype=DTYPE)
output = baseline_model(dummy_input)
output_kernel = kernel_model(dummy_input)
print("Kernel RMSNorm model output shape:", output_kernel.shape)
try:
torch.testing.assert_close(output, output_kernel, rtol=1e-2, atol=1e-2)
print("nBaseline and Kernel RMSNorm model outputs match!")
except AssertionError as e:
print("nBaseline and Kernel RMSNorm model outputs differ barely:")
print(e)
except NameError:
print("nSkipping output comparison as kernel model output was not generated.")
Essential Notes on the KernelModel:
- Kernel Inheritance: The
KernelRMSNormclass inherits fromlayer_norm_kernel_module.layers.LlamaRMSNorm, which is the RMSNorm implementation within the kernel. This permits us to make use of the optimized kernel directly. - Accessing the Function: The precise solution to access the RMSNorm function (
layer_norm_kernel_module.layers.LlamaRMSNorm.forward,layer_norm_kernel_module.rms_norm_forward, or something else) depends entirely on how the kernel creator structured the repository on the Hub. It’s possible you’ll need to examine the loadedlayer_norm_kernel_moduleobject (e.g., usingdir()) or check the kernel’s documentation on the Hub to seek out the right function/method and its signature. I’ve usedrms_norm_forwardas a plausible placeholder and added error handling. - Parameters: We now only define
rms_norm_weight(no bias), consistent with RMSNorm.
4. Benchmarking the Performance Impact
How much faster is the optimized Triton RMSNorm kernel in comparison with the usual PyTorch version? Let’s benchmark the forward pass to seek out out.
File: rmsnorm_benchmark.py
import torch
from rmsnorm_baseline import BaselineModel
from rmsnorm_kernel import KernelModel
DEVICE = "cuda"
DTYPE = torch.float16
def benchmark_model(model, input_tensor, num_runs=100, warmup_runs=10):
model.eval()
dtype = input_tensor.dtype
model = model.to(input_tensor.device).to(dtype)
for _ in range(warmup_runs):
_ = model(input_tensor)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_runs):
_ = model(input_tensor)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_runs
return avg_time_ms
input_size_bench = 4096
hidden_size_bench = 4096
output_size_bench = 10
eps_val_bench = 1e-5
baseline_model_bench = (
BaselineModel(
input_size_bench, hidden_size_bench, output_size_bench, eps=eps_val_bench
)
.to(DEVICE)
.to(DTYPE)
)
kernel_model_bench = (
KernelModel(
input_size_bench,
hidden_size_bench,
output_size_bench,
device=DEVICE,
dtype=DTYPE,
eps=eps_val_bench,
)
.to(DEVICE)
.to(DTYPE)
)
warmup_input = torch.randn(4096, input_size_bench, device=DEVICE, dtype=DTYPE)
_ = kernel_model_bench(warmup_input)
_ = baseline_model_bench(warmup_input)
batch_sizes = [
256,
512,
1024,
2048,
4096,
8192,
16384,
32768,
]
print(
f"{'Batch Size':<12} | {'Baseline Time (ms)':<18} | {'Kernel Time (ms)':<18} | {'Speedup'}"
)
print("-" * 74)
for batch_size in batch_sizes:
torch.cuda.synchronize()
bench_input = torch.randn(batch_size, input_size_bench, device=DEVICE, dtype=DTYPE)
baseline_time = benchmark_model(baseline_model_bench, bench_input)
kernel_time = -1
kernel_time = benchmark_model(kernel_model_bench, bench_input)
baseline_time = round(baseline_time, 4)
kernel_time = round(kernel_time, 4)
speedup = round(baseline_time / kernel_time, 2) if kernel_time > 0 else "N/A"
if kernel_time < baseline_time:
speedup = f"{speedup:.2f}x"
elif kernel_time == baseline_time:
speedup = "1.00x (equivalent)"
else:
speedup = f"{kernel_time / baseline_time:.2f}x slower"
print(f"{batch_size:<12} | {baseline_time:<18} | {kernel_time:<18} | {speedup}")
Expected End result:
As with LayerNorm, a well-tuned RMSNorm implementation using Triton can deliver substantial speedups over PyTorch’s default version—especially for memory-bound workloads on compatible hardware (e.g., NVIDIA Ampere or Hopper GPUs) and with low-precision types like float16 or bfloat16.
Keep in Mind:
- Results may vary depending in your GPU, input size, and data type.
- Microbenchmarks can misrepresent real-world performance.
- Performance hinges on the standard of the kernel implementation.
- Optimized kernels may not profit small batch sizes as a consequence of overhead.
Actual results will rely on your hardware and the particular kernel implementation. Here’s an example of what you would possibly see (on a L4 GPU):
| Batch Size | Baseline Time (ms) | Kernel Time (ms) | Speedup |
|---|---|---|---|
| 256 | 0.2122 | 0.2911 | 0.72x |
| 512 | 0.4748 | 0.3312 | 1.43x |
| 1024 | 0.8946 | 0.6864 | 1.30x |
| 2048 | 2.0289 | 1.3889 | 1.46x |
| 4096 | 4.4318 | 2.2467 | 1.97x |
| 8192 | 9.2438 | 4.8497 | 1.91x |
| 16384 | 18.6992 | 9.8805 | 1.89x |
| 32768 | 37.079 | 19.9461 | 1.86x |
| 65536 | 73.588 | 39.593 | 1.86x |
5. Real World Use Cases
The kernels library remains to be growing but is already getting used in various real world projects, including:
- Text Generation Inference: The TGI project uses the
kernelslibrary to load optimized kernels for text generation tasks, improving performance and efficiency. - Transformers: The Transformers library has integrated the
kernelslibrary to make use of drop in optimized layers without requiring any changes to the model code. This permits users to simply switch between standard and optimized implementations.
Get Began and Next Steps!
You have seen how easy it’s to fetch and use optimized kernels with the Hugging Face Kernel Hub. Able to try it yourself?
-
Install the library:
pip install kernels torch numpyEnsure you could have a compatible PyTorch version and gpu driver installed.
-
Browse the Hub: Explore available kernels on the Hugging Face Hub under the
kernelstag or inside organizations likekernels-community. Search for kernels relevant to your operations (activations, attention, normalization like LayerNorm/RMSNorm, etc.). -
Experiment: Try replacing components in your individual models. Use
get_kernel("user-or-org/kernel-name"). Crucially, inspect the loaded kernel object (e.g.,print(dir(loaded_kernel))) or check its Hub repository documentation to know how you can appropriately call its functions/methods and what parameters (weights, biases, inputs, epsilon) it expects. -
Benchmark: Measure the performance impact in your specific hardware and workload. Remember to ascertain for numerical correctness (
torch.testing.assert_close). -
(Advanced) Contribute: When you develop optimized kernels, consider sharing them on the Hub!
Conclusion
The Hugging Face Kernel Hub provides a strong yet easy solution to access and leverage optimized compute kernels. By replacing standard PyTorch components with optimized versions for operations like RMS Normalization, you may potentially unlock significant performance improvements without the standard complexities of custom builds. Remember to ascertain the specifics of every kernel on the Hub for proper usage. Give it a attempt to see how it could speed up your workflows!
