Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers

-



OpenAI recently released their GPT-OSS series of models. The models feature some novel techniques like MXFP4 quantization, efficient kernels, a brand recent chat format, and more. To enable the discharge of gpt-oss through transformers, we’ve got upgraded the library considerably. The updates make it very efficient to load, run, and fine-tune the models.

On this blog post, we speak about all of the upgrades in-depth, and the way they turn out to be a part of the transformers toolkit so other models (current and future) can profit from them. Providing clean implementations of latest methods in transformers also allows the community to quickly understand and adopt them. Frameworks resembling MLX, llama.cpp or vLLM can use the transformers code as a reference to construct their very own implementations.

For this release, we worked on:

Better part: Most of those features should work across all major models inside transformers!



Zero-build Kernels, downloadable from the Hub

A kernel is a specialized, compact program that runs on accelerators to execute tasks like matrix multiplications, activations, or normalizations. In eager PyTorch, operations trigger individual kernels sequentially, which is simple but can incur extra memory transfers and launch overheads. PyTorch 2.0’s torch.compile with backends like TorchInductor addresses this by routinely fusing and optimizing kernels, delivering 2–10× performance gains.

As well as, the community has created custom kernels for frequent mixtures of operations, not only individual PyTorch ops like matmul. For instance, Flash Attention was created to optimize the critical attention block that defines the transformers architecture, and is present in lots of models including most LLMs. By fastidiously combining all the eye operations inside a single kernel, memory transfers are minimized, memory use is reduced, and speedups might be achieved.

The issue is that every one these various kernels can be found in separate libraries, which creates a dependency bloat in the event that they were to be added to the transformers library. Moreover, these kernels will not be just Python code, they consist of low-level cuda code, glued along with C++ and exposed through a Python layer. This implies they need to be compiled within the goal system, which in turn requires whatever construct system is required by each kernel library.

The kernels package solves this problem by downloading pre-built binaries of supported kernels from the Hub. You simply indicate the kernel you must use, and kernels will search for a version compatible along with your system and download it on first use.



Custom Kernels for GPT-OSS

GPT-OSS, a Mixture of Experts (MoE) model, is an enormous user of Kernels from the Hub. It leverages several custom kernels:

  1. Liger RMSNorm, used as @use_kernel_forward_from_hub("RMSNorm")`
  2. Megablocks MoE kernels: @use_kernel_forward_from_hub("MegaBlocksMoeMLP")
  3. Flash Attention 3 with support for attention sinks.
  4. MXFP4 triton kernels (covered later)

Let’s take a take a look at the primary two ones.

Behind the scenes, the decorators (1 and a couple of) simply point to community-contributed kernels. For instance, RMSNorm comes from liger_kernels, while the MegaBlocksMoeMLP kernel comes from megablocks. Depending in your device (CUDA or ROCm) and whether you’re training or running inference, the fitting kernel is pulled in routinely.

This design is each specific and general: the RMSNorm liger kernels are already being reused across multiple models, and the MoE kernel may very well be applied to future MoEs as well.

Because kernels pulls code from the Hub, you will have to opt-in to this feature by passing use_kernels=True in your model instantiation, as shown below. We enable INFO logging in the instance so you’ll be able to easily confirm that downloadable kernels are in use.

These kernels will not be compatible with mxfp4, so inference will occur in bfloat16 when you use them. Please, benchmark your system for the very best combination in memory and throughput that suits your project!

from transformers import AutoTokenizer, AutoModelForCausalLM

import logging
logging.basicConfig(level=logging.INFO)

model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype="auto",
    device_map="auto",
    use_kernels=True,
)

Running a fast generation yields log messages like

INFO:root:Using layer `LigerRMSNorm` from repo `kernels-community/liger_kernels`
INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks`

Figure 1 shows that, within the system we tested, these kernels work best for larger batch sizes. We all the time recommend to benchmark any performance-related changes as closely to your production conditions as possible.

benchmark with and without kernels
Figure 1: Benchmarking results of custom kernels

You’ll be able to explore and play with the benchmarking script here



Flash Attention 3

OpenAI gpt-oss models use attention sinks, which improves quality and facilitates using longer contexts. The vLLM team added this feature to the newest version of Flash Attention (Flash Attention 3), and the resulting custom kernel is obtainable on the Hub. Currently, this kernel is compatible with the Hopper architecture. If you will have one, that is the solution to enable it:

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype="auto",
    device_map="auto",
+    # Flash Attention with Sinks
+    attn_implementation="kernels-community/vllm-flash-attn3",
)



MXFP4 Quantization

Large language models are memory-hungry. Quantization reduces memory footprint by storing weights (and sometimes activations) in lower-precision formats. For reference, FP32 uses 32 bits per number and BF16 uses 16. By reducing bit width, we trade some precision for smaller models and faster memory movement.

Should you desire a visual primer on quantization trade-offs, Maarten Grootendorst’s article is great: A Visual Guide to Quantization.



What’s MXFP4

explanation of mxfp4 format
Figure 2: The E2M1 format utilized in the MXFP4 format

MXFP4 is a 4-bit floating format with E2M1 layout: 1 sign bit, 2 exponent bits, and 1 mantissa bit, as shown in Figure 2. By itself, E2M1 could be very coarse. MXFP4 compensates with blockwise scaling:

  • Vectors are grouped into blocks of 32 elements.
  • Each block stores a shared scale that restores dynamic range when dequantizing.
  • Inside each block, 4-bit values represent numbers relative to that scale.

This blockwise scheme lets MXFP4 keep range while using only a few bits. In practice, GPT-OSS 20B matches in roughly 16 GB of VRAM and GPT-OSS 120B matches in roughly 80 GB when MXFP4 is energetic, which is the difference between “cannot load” and “can run on a single GPU.” The catch is that matrix multiplies now need to respect block scales. Doing this efficiently at scale requires dedicated kernels.



MXFP4 in transformers

transformers now includes native support for MXFP4, leveraging optimized triton (MXFP4) kernels for enhanced performance. This builds on the community-driven kernel distribution discussed earlier, utilizing pre-compiled kernels from the Hub to simplify deployment.

Key implementation details:

  • Quantizer logic: Present in the MXFP4 quantizer file, this handles the core quantization process for MXFP4.
  • Integration hooks: The MXFP4 integration file enables seamless use of MXFP4 inside the transformers framework.

To ascertain if a model supports MXFP4, inspect its configuration:

from transformers import GptOssConfig

model_id = "openai/gpt-oss-120b"
cfg = GptOssConfig.from_pretrained(model_id)
print(cfg.quantization_config)











If 'quant_method': 'mxfp4' is present, the model will routinely use the MXFP4 pathway with Triton kernels when supported.

Because of this pull request, you’ll be able to fine-tune gpt-oss models and save them on to the Hub in MXFP4 format, streamlining deployment with optimized performance.



Requirements and fallbacks

To run MXFP4 on GPU you wish:

  1. speed up, kernels, and triton>=3.4 installed. Note that Pytorch 2.8 already comes with triton 3.4, so you simply must manually install triton if using Pytorch 2.7.
  2. NVIDIA GPU with compute capability ≥ 7.5. This goes all the way in which back to Tesla, so you’ll be able to run gpt-oss-20b on the free tiers of Google Colab and Kaggle, and on many consumer GPUs.

If these constraints will not be met, transformers falls back to a higher-precision path (bfloat16 is utilized by default), which requires about 4× the memory of MXFP4.

The snippet loads GPT-OSS twice on CUDA: once with Mxfp4Config(dequantize=True) (memory intensive) and once within the default quantized path (memory efficient). Figure 3 shows the quantity of used VRAM after each load so you’ll be able to visualize the savings.

memory used with quantized vs dequantized models
Figure 3: Memory requirements for the quantized and dequantized models



Kernels for MXFP4

Efficient MXFP4 requires kernels that understand 32-element blocks and their scales during GEMMs and fused ops. That is where Kernels from the Hub is available in again. transformers routinely pulls within the MXFP4-aware
Triton kernels from the community repository if you load a model that needs them. The repository will appear in your local cache and might be used throughout the forward pass. For the MXFP4 kernels one doesn’t need to make use of the use_kernels=True parameter like before, it is ready to default in transformers.

Quick sanity check with the Hugging Face cache CLI, after running gpt-oss-20b on a GPU compatible with the triton MXFP4 kernels:

hf cache scan

Sample output:

REPO ID                          REPO TYPE SIZE ON DISK
-------------------------------- --------- ------------
kernels-community/triton_kernels model           536.2K
openai/gpt-oss-20b               model            13.8G

This means the MXFP4 kernels were fetched and can be found for execution.

Let’s run some benchmarks and see how well the MXFP4 kernels perform. In Figure 4, we see that the MXFP4 kernels are even higher than the custom MoE and RMSNorm kernels for larger batches.

benchmark mxfp4 kernels
Figure 4: MXFP4 kernel benchmark

You’ll be able to explore and play with the benchmarking script here



Tensor Parallelism

explaining tensor parallelism
Figure 5: Explanation of tensor parallelism.

Tensor Parallelism (TP) splits tensors inside a layer across multiple GPUs (as shown in Figure 5). Each GPU multiplies its shard in parallel, after which partial results are collected using all-gather or all-reduce operations.
This reduces per-GPU memory and keeps all GPUs working on the same layer, which improves throughput as sequence length or batch size grow. TP is communication-intensive and usually works best on a single machine with fast intra-node links.



What this permits in transformers

transformers implements TP directly in from_pretrained. You’ll be able to start with the predefined plan:


import torch
from transformers import PreTrainedTokenizerFast, GptOssForCausalLM

model_id = "openai/gpt-oss-120b"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_id)
model = GptOssForCausalLM.from_pretrained(
    model_id,
    tp_plan="auto", 
    dtype="auto",
).eval()

messages = [
    {"role": "system", "content": "Be concise."},
    {"role": "user", "content": "Explain KV caching briefly."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    reasoning_effort="low",
).to(model.device)

with torch.inference_mode():
    generations = model.generate(**inputs, max_new_tokens=128)

print(tokenizer.decode(generations[0][inputs["input_ids"].shape[-1]:]))

Should you don’t have the infrastructure to run the above, you’ll be able to just spawn a process on our GPUs using Hugging Face Jobs!

hf jobs run --detach --flavor l4x4 ghcr.io/astral-sh/uv:debian /bin/bash -c 
  "uv venv .venv --python 3.12 && 
  source .venv/bin/activate && 
  uv pip install --upgrade torch numpy transformers speed up triton kernels && 
  wget https://huggingface.co/datasets/ariG23498/distributed/raw/major/tp_gpt_oss.py && 
  torchrun --nproc-per-node=4 tp_gpt_oss.py"

hf jobs is obtainable for all Hugging Face PRO & Enterprise users.

Under the hood, tp_plan="auto" selects a predefined sharding recipe for every layer and wires the essential collectives. You’ll be able to inspect the energetic plan with print(model._tp_plan) if you must confirm what’s being sharded.



When to succeed in for TP

Use TP when the model is simply too large for one GPU and you would like parallel compute, not only memory placement. TP tends to scale throughput with more GPUs, especially for long sequences or larger batches.

Should you are interested by how TP differs from device_map="auto" (memory placement), this short Stack Overflow answer explains the excellence and when to make use of each.

To learn more about TP, listed below are two must-read resources:



Expert Parallelism

Expert Parallelism (EP) shards experts inside MoE layers across GPUs. Each token is routed to 1 or a number of experts, so only those experts run their feed-forward pass. Since experts are independent MLPs, we are able to place different experts on different ranks and exchange only the hidden states for the routed tokens. This keeps the matrix multiplies intact on each rank and replaces tensor slicing with routing and collectives.

Run with multiple processes using torchrun. EP is enabled via the distributed configuration and works with GPT-OSS MoE layers out of the box in transformers.


import torch
from transformers import PreTrainedTokenizerFast, GptOssForCausalLM
from transformers.distributed import DistributedConfig

model_id = "openai/gpt-oss-120b"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_id)
model = GptOssForCausalLM.from_pretrained(
    model_id,
    distributed_config=DistributedConfig(enable_expert_parallel=True), 
    dtype="auto",
).eval()

messages = [
    {"role": "system", "content": "Be concise."},
    {"role": "user", "content": "Explain KV caching briefly."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    reasoning_effort="low",
).to(model.device)

with torch.inference_mode():
    generations = model.generate(**inputs, max_new_tokens=128)

print(tokenizer.decode(generations[0][inputs["input_ids"].shape[-1]:]))

Here is how you’ll run using hf jobs

hf jobs run --detach --flavor l4x4 ghcr.io/astral-sh/uv:debian /bin/bash -c 
  "uv venv .venv --python 3.12 && 
  source .venv/bin/activate && 
  uv pip install --upgrade torch numpy transformers speed up triton kernels && 
  wget https://huggingface.co/datasets/ariG23498/distributed/raw/major/ep_gpt_oss.py && 
  torchrun --nproc-per-node=4 ep_gpt_oss.py"

Whenever you enable Expert Parallelism, Tensor Parallelism can also be activated. This implies you enjoy the very best of each worlds!



Dynamic Sliding Window Layer & Cache

Many recent LLMs use sliding window attention, or a mixture of sliding and global attention layers, as a method to avoid wasting memory and reduce those expensive quadratic matmuls that grow with sequence length. Nevertheless, the dynamic KV cache implementation in transformers used to proceed to allocate space in line with sequence length, without taking a look at the person attention layers. You would all the time optimize memory using compilation (meaning, fixed shapes), but that is a separate scenario altogether.

transformers now has a DynamicSlidingWindowLayer and a config‑aware DynamicCache. If the model config declares sliding‑window or hybrid attention (each sliding and global attention layers are used), the cache stops growing past the window for the sliding layers. Should you don’t pass the config, behavior stays as before (full, ever‑growing KV as sequence length grows).

For models that only use sliding window layers, resembling Mistral 7B, cache memory stops growing when the sequence reaches the window size (4096, on this case). This is smart, since the sliding layers cannot look past the previous 4K tokens anyway.

mistral cache behaviour comparison

OpenAI gpt-oss alternates between sliding and global attention layers, which ends up in total KV cache memory being halved, as we’ll see, as sequence length increases.
This provides us with:

  • Much lower KV‑cache memory for models with sliding or hybrid attention (e.g. GPT‑OSS). Cache growth plateaus once the window is reached (e.g., 4K for Mistral; 128 for GPT‑OSS sliding layers), as an alternative of scaling linearly with total generated tokens. (GitHub, Transformers)
  • Speed/latency wins on long prompts/long generations: smaller KV tensors mean lighter attention reads/writes and fewer memory bandwidth pressure, especially after the window is hit. (That is the central motivation behind sliding‑window/hybrid LLMs.) (AI21, vLLM Blog)



The way to use it

The optimized cache is ready by default, meaning you do not have to make any changes to your existing code. If you must create the DynamicCache explicitly here is how you’ll do it:

from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype="auto",
    device_map="auto",
).eval()

messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "What is the weather like in Madrid?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    reasoning_effort="low",
).to(model.device)

cache = DynamicCache(config=model.config) 

generated = model.generate(
    **inputs,
    max_new_tokens=500,
    past_key_values=cache
)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Figure 6 showcases how much of a difference it makes for us to make use of the Dynamic KV Cache with sliding window attention.

sliding window cache
Figure 6: The memory evaluation of dynamic cache with sliding window attention



Continuous Batching & Paged Attention

A typical autoregressive generation process looks like Figure 7. You input the prefill tokens, and the model predicts each recent token one after the opposite until it predicts the EOS (End of Sequence) token.

prefilling
Figure 7: Autoregressive token generation

Let’s see what the generation process looks like after we pass a batch of inputs. In Figure 8 you notice that some generations finish off sooner than the others. This mismatch of length underutilizes the GPUs.

static batching
Figure 8: Static batching of sequences

Such a batching sequences is named static batching. While this is easy and simple to grasp, it inherently comes with inefficiencies. Only after each sentence is totally generated can we move on to the subsequent batch.

To bypass this issue, we use dynamic batching (also generally known as continuous batching). As an alternative of waiting for all of the generation to complete, we schedule incoming requests to the finished generations. That way, as soon as a generation in a batch is complete, we prefill the batch with the subsequent request. The method looks like Figure 9.

continuous batching
Figure 9: Continuous Batching of sequences

Transformers supports continuous batching with the generate_batch API. This isn’t meant for production-grade model serving –frameworks like vLLM and SGLang are great at that–, but might be very helpful for evaluation and experimentation. Here is an example script that runs CB end to finish on Qwen/Qwen3-4B-Instruct-2507.

We’ve also performed a benchmark between Continuous Batching and Static Batching with 100 samples. In Figure 9, we note that CB is kind of faster than SB.

Figure 9: Continuous vs Static Batching Tokens/Second

You’ll be able to mess around with the benchmark here: SB, CB



Load larger models faster

Whenever you load a big model into your GPU, PyTorch must reserve GPU memory for every layer’s weights. Each of those requests (per layer) takes time, and for multi-billion-parameter models it might mean hundreds of tiny memory allocations, adding as much as a protracted wait before the model is prepared. As an alternative of asking the GPU for brand spanking new memory each time, it might hold on to an enormous chunk once after which hand out slices from it quickly.

PyTorch allocators can do exactly this. The catch is that the allocator only gets fast after you’ve given it some memory to work with. Should you don’t “stock the pantry” first, you continue to find yourself doing many slow trips to the market. This PR (🎉 #36380) taught transformers to pre-stock the pantry before it starts copying model weights.

It:

  • Looks on the device_map (where each layer will live).
  • Pre-allocates a large enough block on each GPU.
  • Then, as layers are copied in, they only slot neatly into this pre-reserved space.

You’ve got to make no changes to your existing code, as that is default behaviour in transformers. Should you use device_map="auto" or provide your individual device map, your model will now load faster routinely. Should you’re running with Tensor Parallel (tp_plan="auto") and torchrun you furthermore mght profit from companion changes that make multi-GPU loading smarter.



Conclusion

transformers moves quickly and it’s community-first. The library evolves on the pace of the sphere because contributors shape it within the open. Pieces added for brand spanking new models turn out to be a part of the toolkit and are reused in future integrations.

This velocity enables day-zero integrations just like the GPT-OSS series. Because the stack becomes increasingly PyTorch-first, it trims bloat and doubles down on the PyTorch paths that matter in practice. The result’s a cleaner core that unlocks recent capabilities through community kernels, quantization, and parallelism plans, while also
standardizing model definitions in order that architectures supported in transformers are a reference and extend across the broader ecosystem.

This post is a one-time snapshot of a process we repeatedly iterate on towards the identical direction: serve the needs of the community. To be up to this point with the newest additions to transformers, check the docs and release notes. And please, keep sharing your feedback and releasing your models in transformers for the community to enjoy 🤗



Read More

If you must go further into particular topics, here is a listing of links that one should visit:

  1. Hugging Face GPT-OSS Recipes Repository
  2. Welcome GPT OSS: OpenAI’s Latest Open-Source Model Family
  3. OpenAI Cookbook: GPT-OSS Topic
  4. Transformers Documentation: Distributed Inference on Multiple GPUs
  5. Matthew Carrigan’s X Thread on GPT OSS Innovations
  6. YouTube Video: OpenAI GPT OSS Announcement
  7. Transformers PR #36380: Faster Model Loading on Accelerators
  8. Transformers PR #36335: Update from_pretrained for Tensor Parallelism
  9. Transformers PR #40039: Latest Dynamic Sliding Window Layer and Cache
  10. HAN Lab Blog: How Attention Sinks Keep Language Models Stable



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x